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Abstract 

An  outstanding  challenge  in  many  problems  throughout  science  and  engineering  is 
to  succinctly  characterize  the  relationships  among  a  large  number  of  interacting  enti¬ 
ties.  Models  based  on  graphs  form  one  major  thrust  in  this  thesis,  as  graphs  often 
provide  a  concise  representation  of  the  interactions  among  a  large  set  of  variables.  A 
second  major  emphasis  of  this  thesis  are  classes  of  structured  models  that  satisfy  certain 
algebraic  constraints.  The  common  theme  underlying  these  approaches  is  the  develop¬ 
ment  of  computational  methods  based  on  convex  optimization,  which  are  in  turn  useful 
in  a  broad  array  of  problems  in  signal  processing  and  machine  learning.  The  specific 
contributions  are  as  follows: 

•  We  propose  a  convex  optimization  method  for  decomposing  the  sum  of  a  sparse 
matrix  and  a  low-rank  matrix  into  the  individual  components.  Based  on  new 
rank-sparsity  uncertainty  principles,  we  give  conditions  under  which  the  convex 
program  exactly  recovers  the  underlying  components. 

•  Building  on  the  previous  point,  we  describe  a  convex  optimization  approach  to 
latent  variable  Gaussian  graphical  model  selection.  We  provide  theoretical  guar¬ 
antees  of  the  statistical  consistency  of  this  convex  program  in  the  high-dimensional 
scaling  regime  in  which  the  number  of  latent /observed  variables  grows  with  the 
number  of  samples  of  the  observed  variables.  The  algebraic  varieties  of  sparse  and 
low-rank  matrices  play  a  prominent  role  in  this  analysis. 

•  We  present  a  general  convex  optimization  formulation  for  linear  inverse  problems, 
in  which  we  have  limited  measurements  in  the  form  of  linear  functionals  of  a  signal 
or  model  of  interest.  When  these  underlying  models  have  algebraic  structure,  the 
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resulting  convex  programs  can  be  solved  exactly  or  approximately  via  semidefinite 
programming.  We  provide  sharp  estimates  (based  on  computing  certain  Gaussian 
statistics  related  to  the  underlying  model  geometry)  of  the  number  of  generic 
linear  measurements  required  for  exact  and  robust  recovery  in  a  variety  of  settings. 

•  We  present  convex  graph  invariants,  which  are  invariants  of  a  graph  that  are  con¬ 
vex  functions  of  the  underlying  adjacency  matrix.  Graph  invariants  characterize 
structural  properties  of  a  graph  that  do  not  depend  on  the  labeling  of  the  nodes; 
convex  graph  invariants  constitute  an  important  subclass,  and  they  provide  a  sys¬ 
tematic  and  unified  computational  framework  based  on  convex  optimization  for 
solving  a  number  of  interesting  graph  problems. 

We  emphasize  a  unified  view  of  the  underlying  convex  geometry  common  to  these 
different  frameworks.  We  describe  applications  of  these  methods  to  problems  in  financial 
modeling  and  network  analysis,  and  conclude  with  a  discussion  of  directions  for  future 
research. 
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Introduction 


An  outstanding  challenge  in  many  applications  throughout  science  and  engineering  is  to 
succinctly  characterize  the  relationships  among  a  large  number  of  interacting  entities. 
In  a  statistical  model  selection  setting  we  wish  to  learn  a  “simple”  statistical  model  to 
approximate  the  behavior  observed  in  a  collection  of  random  variables.  Modern  data 
analysis  tasks  in  geophysics,  economics,  and  image  processing  often  involve  learning  sta¬ 
tistical  models  over  collections  of  random  variables  that  may  number  in  the  hundreds  of 
thousands,  or  even  a  few  million.  In  a  computational  biology  setting  a  typical  question 
involving  gene  regulatory  networks  is  to  discover  the  interaction  patterns  among  a  col¬ 
lection  of  genes  in  order  to  better  understand  how  a  gene  influences  or  is  influenced  by 
other  genes.  Similar  problems  also  arise  in  the  analysis  of  biological,  social,  or  chemical 
reaction  networks  in  which  one  seeks  to  better  understand  a  complicated  network  by 
decomposing  it  into  simpler  networks.  Models  based  on  graphs  offer  a  fruitful  frame¬ 
work  to  solve  such  problems,  as  graphs  often  provide  a  concise  representation  of  the 
interactions  among  a  large  set  of  variables. 

In  this  thesis  we  explore  a  set  of  research  directions  at  the  intersection  of  graphs 
and  statistics.  An  important  instance  of  a  framework  that  lies  in  this  intersection  is 
that  of  graphical  models ,  in  which  a  statistical  model  is  defined  with  respect  to  a  graph. 
Another  example  is  one  in  which  we  have  statistical  models  over  the  space  of  graphs ,  so 
that  a  graph  itself  is  viewed  as  a  sample  drawn  from  a  probability  distribution  defined 
over  some  set  of  graphs.  Natural  questions  that  arise  in  standard  statistical  settings 
such  as  deconvolution  can  then  be  posed  in  a  deterministic  framework  in  this  graph 
setting  as  well. 

A  common  theme  underlying  our  investigations  is  the  development  of  tractable 
computational  tools  based  on  convex  optimization,  which  possess  numerous  favorable 
properties.  Due  to  their  powerful  modeling  capabilities,  convex  optimization  methods 
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CHAPTER  1.  INTRODUCTION 


can  provide  tractable  formulations  for  solving  difficult  combinatorial  problems  exactly  or 
approximately.  Further  convex  programs  may  often  be  solved  effectively  using  general- 
purpose  off-the-shelf  software.  Finally  one  can  also  give  conditions  for  the  success  of 
these  convex  relaxations  based  on  standard  optimality  results  from  convex  analysis. 

■  1.1  Main  Contributions 

In  this  section  we  outline  the  main  contributions  of  this  thesis.  Details  about  related 
previous  work  are  given  in  the  relevant  chapters.  The  research  and  results  of  Chapters  3, 
4,  5,  and  6  correspond  to  the  papers  [37],  [33],  [36],  and  [34]  respectively. 

Rank-Sparsity  Uncertainty  Principles  and  Matrix  Decomposition 

Suppose  we  are  given  a  matrix  that  is  formed  by  adding  an  unknown  sparse  matrix  to 
an  unknown  low-rank  matrix.  The  goal  is  to  decompose  the  given  matrix  into  its  sparse 
and  low-rank  components.  Such  a  problem  is  intractable  to  solve  in  general,  and  arises 
in  a  number  of  applications  such  as  model  selection  in  statistics,  system  identification 
in  control,  optical  system  decomposition,  and  matrix  rigidity  in  computer  science.  In¬ 
deed  sparse-plus-low-rank  matrix  decomposition  is  the  main  challenge  in  latent- variable 
Gaussian  graphical  model  selection,  which  is  discussed  next  (and  in  greater  detail  in 
Chapter  4).  In  Chapter  3,  we  propose  a  convex  optimization  formulation  to  splitting  the 
specified  matrix  into  its  components,  by  minimizing  a  linear  combination  of  the  t\  norm 
and  the  nuclear  norm  (the  sum  of  the  singular  values  of  a  matrix)  of  the  components. 
We  develop  a  notion  of  rank-sparsity  incoherence ,  expressed  as  an  uncertainty  principle 
between  the  sparsity  pattern  of  a  matrix  and  its  row  and  column  spaces,  and  use  it  to 
characterize  both  fundamental  identifiability  as  well  as  (deterministic)  sufficient  condi¬ 
tions  for  exact  recovery.  The  analysis  is  geometric  in  nature  with  the  tangent  spaces  to 
the  algebraic  varieties  of  sparse  and  low-rank  matrices  playing  a  prominent  role. 

Latent  Variable  Gaussian  Graphical  Model  Selection 

Graphical  models  are  widely  used  in  many  applications  throughout  machine  learning, 
computational  biology,  statistical  signal  processing,  and  statistical  physics  as  they  offer  a 
compact  representation  for  the  statistical  structure  among  a  large  collection  of  random 
variables.  Graphical  models  in  which  the  underlying  graph  is  sparse  typically  tend 
to  be  better  suited  for  efficiently  performing  tasks  such  as  inference  and  estimation. 
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In  the  setting  of  Gaussian  graphical  models  where  the  random  variables  are  jointly 
Gaussian,  sparsity  in  the  graph  structure  corresponds  to  sparsity  in  the  inverse  of  the 
covariance  matrix  of  the  random  variables,  also  called  the  concentration  matrix.  Thus 
Gaussian  graphical  model  selection  is  the  problem  of  learning  a  model  described  by  a 
sparse  concentration  matrix  to  best  approximate  the  observed  statistics  in  a  collection  of 
random  variables  [119].  However  a  significant  difficulty  arises  if  we  do  not  have  sample 
observations  of  some  of  the  relevant  variables,  because  a  whole  set  of  extra  correlations 
are  induced  among  the  observed  variables  due  to  marginalization  over  the  unobserved, 
hidden  variables.  Is  it  possible  to  discover  the  number  of  hidden  components,  and  to 
learn  a  statistical  model  over  the  entire  collection  of  variables?  If  only  we  realized  that 
much  of  the  seemingly  complicated  correlation  structure  among  the  observed  variables 
can  be  explained  as  the  effect  of  marginalization  over  a  few  hidden  variables,  we  would 
be  able  to  learn  a  “simple”  statistical  model  among  the  observed  variables  and  a  few 
additional  hidden  variables. 

In  the  Gaussian  setting  this  problem  reduces  to  one  of  approximating  a  given  matrix 
by  the  sum  of  a  sparse  matrix  and  a  low-rank  matrix:  the  low-rank  matrix  corresponds 
to  the  correlations  induced  by  marginalization  over  latent  variables  (it  is  low-rank  as  the 
number  of  hidden  variables  is  usually  much  smaller  than  the  number  of  observed  vari¬ 
ables),  and  the  sparse  matrix  corresponds  to  the  conditional  graphical  model  structure 
among  the  observed  variables  conditioned  on  the  latent  variables.  From  a  statistical 
viewpoint  this  approach  to  modeling  can  be  seen  as  a  blend  of  dimensionality  reduction 
(to  identify  latent  variables)  and  graphical  modeling  (to  capture  remaining  statistical 
structure  not  attributable  to  the  latent  variables).  In  Chapter  4,  we  propose  a  tractable 
convex  programming  estimator  for  latent  variable  Gaussian  graphical  model  selection 
based  on  regularized  maximum-likelihood;  motivated  by  the  results  in  Chapter  3  the 
regularizer  uses  the  i\  norm  for  the  sparse  component,  and  the  nuclear  norm  for  the 
low-rank  component.  In  addition  to  being  computationally  efficient  to  evaluate,  this 
estimator  enjoys  favorable  statistical  consistency  properties.  Indeed  we  show  that  con¬ 
sistent  model  selection  is  possible  under  suitable  identifiability  conditions  even  if  the 
number  of  observed/latent  variables  is  on  the  same  order  as  the  number  of  samples  of 
the  observed  variables.  The  rank-sparsity  uncertainty  principles  of  Chapter  3  described 
above  are  fundamental  to  our  analysis.  Previous  approaches  to  latent  variable  graphical 
modeling  using  variants  of  the  Expectation-Maximization  (EM)  algorithm  do  not  share 
these  favorable  properties,  as  they  optimize  non-convex  functions  (hence  converging 
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only  to  local  optima)  and  have  no  high-dimensional  consistency  guarantees. 

Convex  Optimization  for  Inverse  Problems 

Many  of  the  questions  from  the  previous  two  sections  can  be  viewed  as  instances  of 
inverse  problems  in  which  we  wish  to  recover  simple  and  structured  models  given  limited 
information.  In  Chapter  5  we  study  a  general  class  of  linear  inverse  problems  in  which 
the  goal  is  to  recover  a  model  given  a  small  number  of  linear  measurements.  Such 
problems  are  generally  ill-posed  as  the  number  of  measurements  available  is  typically 
smaller  than  the  dimension  of  the  model.  However  in  many  practical  applications 
of  interest,  models  are  often  constrained  structurally  so  that  they  only  have  a  few 
degrees  of  freedom  relative  to  their  ambient  dimension.  Exploiting  such  structure  is 
the  key  to  making  linear  inverse  problems  well-posed.  The  class  of  simple  models 
that  we  consider  in  Chapter  5  are  those  formed  as  the  sum  of  a  few  atoms  from  some 
elementary  atomic  set;  examples  include  well-studied  cases  such  as  sparse  vectors  (e.g., 
signal  processing,  statistics)  and  low-rank  matrices  (e.g.,  control,  statistics),  as  well 
as  several  others  such  as  sums  of  a  few  permutations  matrices  (e.g.,  ranked  elections, 
multiobject  tracking),  low-rank  tensors  (e.g.,  vision,  neuroscience),  orthogonal  matrices 
(e.g.,  machine  learning),  and  atomic  measures  (e.g.,  system  identification).  We  describe 
a  general  framework  to  convert  such  notions  of  simplicity  into  convex  penalty  functions, 
which  give  rise  to  convex  optimization  solutions  to  linear  inverse  problems.  These 
convex  programs  can  be  solved  via  semidefinite  programming  under  suitable  conditions, 
and  they  significantly  generalize  previous  approaches  based  on  t\  norm  and  nuclear 
norm  minimization  for  recovering  sparse  and  low-rank  models.  Our  results  give  general 
conditions  and  bounds  on  the  number  generic  measurements  under  which  exact  or 
robust  recovery  of  the  underlying  model  is  possible  via  convex  optimization.  Thus  this 
work  extends  the  catalog  of  simple  models  (beyond  sparse  vectors,  i.e.,  compressed 
sensing,  and  low-rank  matrices)  that  can  be  recovered  from  limited  linear  information 
via  tractable  convex  programming. 

Convex  Graph  Invariants 

Investigating  graphs  from  the  viewpoint  of  statistics  provides  a  very  fruitful  research 
agenda,  as  many  questions  from  classical  statistics  can  be  posed  in  a  deterministic 
setting  in  which  data  are  represented  as  graphs.  As  an  example  suppose  that  we  have 
a  composite  graph  formed  as  the  combination  of  two  graphs  Q\  and  Q2  overlaid  on 
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the  same  set  of  nodes.  We  are  only  given  the  composite  graph  without  any  additional 
information  about  the  relative  labeling  of  the  nodes,  which  may  reveal  the  structure  of 
the  individual  components.  Can  we  deconvolve  the  composite  graph  into  the  individual 
components?  As  discussed  in  Chapter  6  such  a  problem  is  of  interest  in  network  analysis 
in  social  and  biological  networks  in  which  one  seeks  to  decompose  a  complex  network 
into  simpler  components  to  better  understand  the  behavior  of  the  composite  network. 
Other  problems  motivated  by  statistics  include  hypothesis  testing  between  families  of 
graphs,  and  generating /sampling  graphs  with  certain  desired  structural  properties  (see 
Chapter  6  for  details). 

An  important  goal  towards  solving  these  and  many  other  graph  problems  is  to 
characterize  the  underlying  structural  properties  of  a  graph.  Graph  invariants  play  an 
important  role  in  describing  such  abstract  structural  features,  as  they  do  not  depend 
on  the  labeling  of  the  nodes  of  the  graph.  Examples  of  commonly  used  graph  invariants 
include  the  spectrum  of  a  graph  (i.e.,  eigenvalues  of  the  adjacency  matrix),  or  the  degree 
sequence.  In  Chapter  6  we  introduce  and  investigate  convex  graph  invariants,  which  are 
graph  invariants  that  are  convex  functions  of  the  adjacency  matrix  of  a  graph.  Examples 
of  such  functions  of  a  graph  include  the  maximum  degree,  the  MAXCUT  value  (and  its 
semidefinite  relaxation),  the  second  smallest  eigenvalue  of  the  Laplacian,  and  spectral 
invariants  such  as  the  sum  of  the  k  largest  eigenvalues  of  the  adjacency  matrix.  Convex 
graph  invariants  provide  a  systematic  and  unified  computational  framework  based  on 
convex  optimization  for  solving  a  number  of  interesting  graph  problems  such  as  those 
described  above. 
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Chapter  2 


Background 


In  this  chapter  we  emphasize  the  main  themes  common  to  the  rest  of  this  thesis.  Our 
exposition  is  brief  as  we  only  provide  the  basic  relevant  technical  background,  and  we 
refer  the  reader  to  the  texts  [124]  (on  convex  analysis)  and  [79]  (on  algebraic  geometry) 
for  more  details.  The  individual  chapters  also  give  more  background  pertaining  to  the 
corresponding  chapter. 

■  2.1  Basics  of  Convex  Analysis 

A  set  C  C  lp  is  a  convex  set  if  for  any  x,  y  6  C  and  any  scalar  A  6  [0, 1],  we  have  that 
Ax  +  (1  —  A)y  €  C.  A  convex  set  C  is  also  a  cone  if  it  is  closed  under  positive  linear 
combinations.  Such  convex  cones  are  fundamental  objects  of  study  in  convex  analysis, 
and  play  an  important  role  in  all  the  main  chapters  of  this  thesis. 

The  polar  C*  of  a  cone  C  is  the  cone 

C*  =  {x  £  Mp  :  (x,  z)  <  0  Vz  6  C}. 

Given  a  closed  convex  set  C  £  MP  and  some  nonzero  x  £  MP  we  define  the  tangent  cone 
at  x  with  respect  to  C  as 


Tc (x)  =  conejz  —  x  :  z  E  C}.  (2-1) 

Here  cone(-)  refers  to  the  conic  hull  of  a  set  obtained  by  taking  nonnegative  linear 
combinations  of  elements  of  the  set.  The  cone  Tc  (x)  is  the  set  of  directions  to  points 
in  C  from  the  point  x.  The  normal  cone  iV"c(x)  at  x  with  respect  to  the  convex  set  C 
is  defined  to  be  the  polar  cone  of  the  tangent  cone  ?c(x),  i.e. ,  the  normal  cone  consists 
of  vectors  that  form  an  obtuse  angle  with  every  vector  in  the  tangent  cone  T<j(x). 

A  real-valued  function  /  defined  on  a  convex  set  C  is  said  to  be  a  convex  function 
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if  for  any  x,  y  €  C  and  any  scalar  A  E  [0, 1],  we  have  that 

/(Ax  +  (1  -  A)y)  <  A/(x)  +  (1  -  A)/(y). 

Following  standard  notation  in  convex  analysis,  we  denote  the  subdifferential  of  a  convex 
function  /  at  a  point  x  in  its  domain  by  <9/(x).  The  subdifferential  <9/(x)  consists  of 
all  y  such  that 

/(x)>/(x)  +  (  y,x-x),  Vx. 

■  2.2  Representation  of  Convex  Sets 

Convex  programs  denote  those  optimization  problems  in  which  we  seek  to  minimize 
a  convex  function  over  a  convex  constraint  set  [24].  For  example  linear  programming 
and  semidefinite  programming  form  two  prominent  subclasses  in  which  linear  functions 
are  minimized  over  constraint  sets  given  by  affine  spaces  intersecting  the  nonnegative 
orthant  (in  linear  programming)  and  the  positive-semidefinite  cone  (in  semidefinite 
programming)  [11].  Roughly  speaking  convex  programs  are  tractable  to  solve  compu¬ 
tationally  if  the  convex  objective  function  can  be  computed  efficiently,  and  membership 
in  the  convex  constraint  sets  can  be  certified  efficiently.  Hence,  the  tractable  represen¬ 
tation  of  convex  sets  is  an  important  point  that  must  be  addressed  in  order  to  develop 
practically  feasible  computational  solutions  to  convex  optimization  problems. 

Any  closed  convex  set  has  two  dual  representations.  Specifically,  an  element  x  be¬ 
longing  to  a  convex  set  C  is  an  extreme  point  if  it  cannot  be  expressed  as  the  midpoint 
of  the  line  segment  between  some  two  points  in  C.  With  this  definition  the  first  repre¬ 
sentation  of  a  convex  set  is  as  the  convex  hull  of  all  its  extreme  points.  With  respect  to 
this  representation,  certifying  membership  in  a  convex  set  means  that  we  must  produce 
a  representation  of  a  point  as  the  convex  combination  of  (a  subset  of)  extreme  points.  A 
second  representation  of  a  convex  set  is  as  the  intersection  of  (possibly  infinitely  many) 
halfspaces.  Here  certifying  membership  of  a  point  in  a  convex  set  means  that  we  need 
to  verify  that  this  point  satisfies  the  constraints  defining  the  convex  set.  Using  the  tools 
of  convex  duality  one  can  transform  between  these  two  alternate  representations  of  a 
convex  set  (see  [124]  for  more  details). 

In  this  section  we  provide  several  examples  of  convex  sets  and  their  representations, 
with  the  objective  of  highlighting  the  main  ideas  that  lead  to  tractable  representations. 
In  particular  the  concept  of  lift-and-project  plays  a  central  role  in  many  examples  of 
efficient  representations  of  convex  sets.  The  lift-and-project  concept  is  simple  -  we  wish 
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Figure  2.1.  The  cross-polytope  in  two  dimensions. 


to  express  a  convex  set  C  E  as  the  projection  of  a  convex  set  C  E  in  some 
higher-dimensional  space  (i.e.,  p'  >  p).  Such  methods  are  useful  if  p'  is  not  too  much 
larger  than  p  and  if  C  has  an  efficient  representation  in  the  higher-dimensional  space 
Mp  .  Lift-and-project  provides  a  very  powerful  representation  tool,  as  will  be  seen  in 
the  examples  to  follow. 

■  2.2.1  Cross-polytope 

The  cross-polytope  (see  Figure  2.1)  is  the  unit  ball  of  the  fi-nornr: 


The  tj-norm  has  been  the  focus  of  much  attention  recently  due  to  its  sparsity-inducing 
properties  [29, 53, 54] . 

In  a  statistical  model  selection  setting  sparsity  corresponds  to  models  that  consist  of 
few  nonzero  parameters.  Specialized  to  a  linear  regression  or  feature  selection  context, 
penalty  functions  based  on  the  £i-norm  lead  to  parameter  vectors  that  are  sparse, 
i.e.,  responses  are  expressed  as  the  linear  combination  of  a  small  number  of  features 
[135].  Specialized  to  a  covariance  selection  context,  tj-norm  penalty  functions  lead 
to  distributions  defined  by  sparse  covariance  and  concentration  matrices  [15,16,61]. 
Sparsity  has  also  played  a  central  role  in  signal  processing  as  a  variety  of  applications 
exploit  the  expression  of  signals  as  the  sum  of  few  elements  from  a  dictionary,  e.g., 
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approximating  natural  images  as  the  weighted  sum  of  a  few  wavelet  basis  functions. 
The  benefits  of  such  sparse  approximations  are  clear  for  tasks  such  as  compression,  but 
extend  also  to  tasks  such  as  signal  denoising  and  classification. 

How  do  we  represent  the  p-dimensional  cross-polytope  ?  While  the  cross-polytope 
has  2 p  vertices,  a  direct  specification  in  terms  of  halfspaces  involves  2P  inequalities: 


However  we  can  obtain  a  tractable  inequality  representation  by  lifting  to  M2p  and  then 
projecting  onto  the  first  p  coordinates: 

=  <|x  G  |  3z  G  s.t.  —  z,;  <  x,  <  z i  Vz,  z i  <1,  Zj  >  0  Vz  j  . 

Note  that  in  M2p  with  the  additional  variables  z,  we  have  only  3p  +  1  inequalities. 

Next  suppose  x  G  is  a  point  on  the  boundary  of  the  cross-polytope,  i.e.,  1 1 x 1 1 = 
1.  Letting  Q  C  {1, . . .  ,p}  denote  the  indices  at  which  x  is  nonzero,  the  normal  cone  at 
x  with  respect  to  B%  is  given  as: 

Nbp  (x)  =  {z  |  z i  =  tsgn(xj)  for  i  G  H,  |z,;|  <  t  for  i  G  for  some  t  >  0}  . 

h 

Here  sgn(-)  is  the  sign  function. 

■  2.2.2  Nuclear-norm  ball 

The  nuclear  norm  of  a  matrix  (see  Figure  2.2  for  the  unit  ball)  is  the  sum  of  its  singular 
values: 

P1I*  =  l>iP0- 

i 

Analogous  to  the  case  of  the  .fj-norm,  the  nuclear  norm  has  received  much  attention 
recently  because  it  induces  low-rank  structure  in  matrices  in  a  number  of  settings  [30, 
121]. 

In  a  statistics  context  low-rank  covariance  matrices  are  used  in  factor  analysis,  and 
they  represent  the  property  that  the  corresponding  random  variables  lie  on  or  near  a 
low-dimensional  subspace.  In  a  control  setting  low-rank  system  matrices  correspond  to 
systems  with  a  low-dimensional  state  space,  i.e.,  systems  with  small  model  order.  In 
optical  system  modeling  low-rank  matrices  represent  so-called  coherent  systems,  which 
correspond  to  low-pass  optical  filters. 
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Figure  2.2.  The  nuclear-norm  ball  of  2  x  2  symmetric  matrices.  Here  x,  y  denote  the  diagonal  entries, 
and  s  the  off-diagonal  entry. 


Unlike  with  the  U-norm  the  nuclear-norm  of  a  matrix  has  no  closed-form  represen¬ 
tation,  but  can  instead  be  expressed  variationally.  Specifically,  the  spectral  or  operator 
norm  ||  •  ||  of  a  matrix  (the  largest  singular  value)  is  the  dual  norm  of  the  nuclear-nornr 

II  '  II*  [82]: 

|! A'||*  =  max{Tr(X'Y)|  ||Y||  <  1}. 

Further,  the  spectral  norm  admits  a  simple  semidefinite  characterization: 

„  „  (  tJn  Y  \ 

Y  =  min  t  s.t.  Y  0. 

4  V  Y'  tin  ) 

We  then  obtain  the  following  SDP  characterization  of  the  nuclear-norm: 

||A||*  =  min  i(trace(Wi)  +  trace(W2)) 

W\,Wi 

(  W\  X  \ 

s.t.  Y  0. 

\  X'  W2  ) 

This  semidefinite  characterization  can  in  turn  be  used  to  specify  the  unit  ball  of  the 
nuclear-norm: 

=  {X  e  Rpxp  |  ||X||*  <  1}  . 

Suppose  X  £  BP*P  is  a  boundary  point  of  the  nuclear-norm  ball,  i.e.,  ||X||*  =  1. 
Let  X  =  UTjV'  be  a  singular  value  decomposition  of  X,  such  that  U,  V  £  j^pxrank(x) 
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Figure  2.3.  The  permutahedron  generated  by  the  vector  [1,2, 3, 4]'. 


and  £  e  j|>rank(x)xrank(x).  j?urther  let  W  C  M.pxp  denote  the  subspace  of  matrices  given 
by  the  span  of  matrices  with  either  the  same  row  space  or  the  same  column  space  as 


X: 


W  =  |  UM'  +  NV'  |  M,Ne  RPXrank(.Y)  j  _ 


Then  we  have  the  following  description  of  the  normal  cone  at  X  with  respect  to  Bp^p: 


iVBPxp(X)  =  {tUVT  +  W  €  Rpxp  I  WTU  =  0 ,WV  =  OJWH*  <t,t>  0}  . 

Here  V  denotes  the  projection  operator.  Notice  the  parallels  with  the  normal  cone  with 
respect  to  the  cross-polytope. 


■  2.2.3  Permutahedron 

The  permutahedron  (see  Figure  2.3)  generated  by  a  vector  x  £  MP  is  the  convex  hull  of 
all  permutations  of  the  vector  x: 

-Pp(x)  =  convjnx  |  V  permutation  matrices  n}. 

The  set  of  permutations  of  the  vector  [1  represents  the  set  of  all  rankings  of 

p  objects.  Consequently  the  permutahedron,  and  the  related  Birkhoff  polytope  (the 
convex  hull  of  permutation  matrices),  lead  to  useful  convex  relaxation  approaches  in 
ranking  and  tracking  problems  (see  Chapter  5). 
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The  permutahedron  Pp(x)  of  a  vector  composed  of  distinct  entries  consists  of  pi 
extreme  points  and  a  direct  halfspace  representation  requires  2P  —  2  inequalities  (one 
for  each  proper  subset  of  {1, . . .  ,p}).  However  the  permutahedron  still  has  a  tractable 
representation  via  lifting.  Before  describing  this  lifted  specification,  we  require  some 
notation.  For  any  vector  y  let  y  denote  the  vector  obtained  by  sorting  the  entries  of  y 
in  descending  order.  A  vector  y  G  Rp  is  said  to  be  majorized  by  a  vector  x  G  Rp  if  the 
following  conditions  hold: 

k  k 

<  X]**’  =  !,•••, P-  1,  and  X!y*  =  ^x,:-  (2-2) 

i=  1  i=  1  i  i 

The  majorization  principle  states  that  the  permutahedron  Pp(x)  is  exactly  the  set  of 
vectors  majorized  by  x  [11]: 

Pp(x)  =  {y  G  Mp  |  y  majorized  by  x}. 

Consequently  a  tractable  description  of  the  permutahedron  can  be  obtained  if  the  ma¬ 
jorization  inequalities  of  (2.2)  can  be  expressed  tractably.  Since  Yli=i  is  a  fixed 
quantity,  we  require  a  tractable  expression  for  sets  of  the  form 

{k 

y  e®p  I  ^  y  i<c 

i= 1 

Letting  e£P  denote  the  all-ones  vector,  we  have  that  [11] 

Qk{c )  =  {y  G  |  3z  G  Mp,  sGR  s.t.  c  —  ks  —  e'z  >0,  z  >  0,  z  —  y  +  se  >  0}  . 

Here  the  last  two  inequalities  are  to  be  interpreted  elementwise.  Consequently  we  have 
a  tractable  description  of  the  permutahedron  by  lifting  to  Mp2+p_1  and  using  2p2  —  2p—  1 
inequalities  and  one  equation. 

It  turns  out  that  a  more  efficient  representation  of  the  permutahedron  can  be  spec¬ 
ified  by  lifting  to  a  space  of  dimension  0(jp\og{p))  and  using  only  0(p\og(p))  inequal¬ 
ities  [71].  This  representation  is  based  on  the  structure  of  certain  sorting  networks, 
and  is  in  some  sense  the  most  efficient  possible  representation  of  the  permutahedron 
(see  [71]  for  more  details). 

■  2.2.4  Schur-Horn  orbitope 

Let  Sp  denote  the  space  of  p  x  p  symmetric  matrices,  and  let  A (IV)  denote  the  sorted 
(in  descending  order)  eigenvalues  of  a  symmetric  matrix  N.  Given  a  symmetric  matrix 
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M  G  Sp  the  Schur-Horn  orbitope  specified  by  M  is  defined  as  the  convex  hull  of  all 
matrices  with  the  same  spectrum  as  that  of  M: 

SHP(M )  =  conv{ UMU'  \  U  G  Rpxp  orthogonal}. 

The  Schur-Horn  orbitope  is  the  spectral  analog  of  the  permutahedron,  and  the  pro¬ 
jection  of  SHP(M )  onto  the  set  of  diagonal  matrices  is  exactly  the  permutahedron 
Pp(X(M)). 

A  spectral  majorization  principle  can  be  used  to  give  a  tractable  representation  of 
the  Schur-Horn  orbitope  [11].  Specifically  we  have  that 

SHP(M )  =  {TV  G  Sympxp  |  A (TV)  majorized  by  A(M)}  . 

Again  we  have  the  following  tractable  representation  of  sets  constraining  the  sum  of 
the  top  k  eigenvalues  of  a  matrix  [11]: 

k 

Rk (c)  =  {Ne  Sympxp  |  ^  A, (TV)  <  c} 

i—  1 

=  {TV  G  Sympxp  |  3Z  ^  0,  s  G  R  s.t.  c-ks-  TV (Z)  >0,  Z  -  N  +  slp  t  0}. 
Here  Ip  represents  the  p  x  p  identity  matrix. 

■  2.3  Semidefinite  Relaxations  using  Theta  Bodies 

In  many  cases  of  interest  convex  sets  may  not  be  tractable  to  represent,  and  it  is  of 
interest  to  develop  tractable  approximations.  Here  we  describe  a  method  to  obtain  a 
hierarchy  of  (increasingly  complex)  representations  for  convex  sets  given  as  the  convex 
hulls  of  sets  with  algebraic  structure.  Specifically  we  focus  on  the  setting  in  which  our 
convex  bodies  arise  as  the  convex  hulls  of  algebraic  varieties,  which  play  a  prominent 
role  in  this  thesis.  A  real  algebraic  variety  A  C  MP  is  the  set  of  real  solutions  of  a  system 
of  polynomial  equations: 

•A  =  {x  :  gj(x)  =  0,  Vj}, 

where  {gj}  is  a  finite  collection  of  polynomials  in  p  variables. 

A  basic  question  is  to  derive  tractable  representations  of  the  convex  hull  conv(H)  of 
a  variety  A.  All  our  discussion  here  is  based  on  results  described  in  [77]  for  semidefinite 
relaxations  of  convex  hulls  of  algebraic  varieties  using  theta  bodies.  We  only  give  a 
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brief  review  of  the  relevant  constructions,  and  refer  the  reader  to  the  vast  literature  on 
this  subject  for  more  details  (see  [77,114]  and  the  references  therein). 

To  begin  with  we  note  that  a  sum- of -squares  (SOS)  polynomial  in  M[x]  (the  ring 
of  polynomials  in  the  variables  xi, . . .  ,xp)  is  a  polynomial  that  can  be  written  as  the 
(finite)  sum  of  squares  of  other  polynomials  in  M[x].  Verifying  the  nonnegativity  of  a 
multivariate  polynomial  is  intractable  in  general,  and  therefore  SOS  polynomials  play 
an  important  role  in  real  algebraic  geometry  as  an  SOS  polynomial  is  easily  seen  to  be 
nonnegative  everywhere.  Further  checking  whether  a  polynomial  is  an  SOS  polynomial 
can  be  accomplished  efficiently  via  semidefinite  programming  [114]. 

Turning  our  attention  to  the  description  of  the  convex  hull  of  an  algebraic  variety, 
we  will  assume  for  the  sake  of  simplicity  that  the  convex  hull  is  closed.  Let  I  C  M[x] 
be  a  polynomial  ideal  [79],  and  let  Vr(7)  G  Mp  be  its  real  algebraic  variety: 

Vr{I)  =  {x  :  /(x)  =  0,  V/  G  /}. 

One  can  then  show  that  the  convex  hull  conv(VK(/))  is  given  as: 

conv(hR(/))  =  {x  :  /(x)  >  0,  V/  linear  and  nonnegative  on  Vr(7)} 

=  {x:/(x)  >  0,  V/  linear  s.t.  f  =  h  +  g,  V  h  nonnegative,  V  g  G  7} 
=  {x  :  /(x)  >  0,  V/  linear  s.t.  /  nonnegative  modulo  I}. 

A  linear  polynomial  here  is  one  that  has  a  maximum  degree  of  one,  and  the  meaning 
of  “modulo  an  ideal”  is  clear.  As  nonnegativity  modulo  an  ideal  may  be  intractable  to 
check,  we  can  consider  a  relaxation  to  a  polynomial  being  SOS  modulo  an  ideal,  i.e.,  a 
polynomial  that  can  be  written  as  Yll-i  +  9  f°r  9  in  the  ideal.  Since  it  is  tractable 
to  check  via  semidefinite  programmming  whether  bounded-degree  polynomials  are  SOS, 
the  fc-th  theta  body  of  an  ideal  7  is  defined  as  follows  in  [77]: 

TH^(J)  =  {x  :  /(x)  >  0,  V/  linear  s.t.  /  is  k- sos  modulo  /}. 

Here  k- sos  refers  to  an  SOS  polynomial  in  which  the  components  in  the  SOS  decom¬ 
position  have  degree  at  most  k.  The  k-th.  theta  body  TH^(/)  is  a  convex  relaxation  of 
conv(VEj(/)),  and  one  can  verify  that 

conv(Vfe(/))  C  •  •  •  C  TH*+1(J)  C  TH k(VR(I)). 

By  the  arguments  given  above  (see  also  [77])  these  theta  bodies  can  be  described  us¬ 
ing  semidefinite  programs  of  size  polynomial  in  k.  Hence  by  considering  theta  bodies 
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THfc(I)  with  increasingly  larger  k,  one  can  obtain  a  hierarchy  of  tighter  semidefinite 
relaxations  of  conv(hffi(/)).  We  also  note  that  in  many  cases  of  interest  such  semidefi¬ 
nite  relaxations  preserve  low-dinrensional  faces  of  the  convex  hull  of  a  variety,  although 
these  properties  are  not  known  in  general. 

Example  The  cut  polytope  is  defined  as  the  convex  hull  of  all  symmetric  rank-one 
signed  matrices: 

CPP  =  convjzz3  :  z  €  {  — 1,+1}P}. 

It  is  well-known  that  the  cut  polytope  is  intractable  to  characterize  [47],  and  therefore 
we  need  to  use  tractable  relaxations  instead.  The  following  popular  relaxation  is  used 
in  semidefinite  approximations  of  the  MAXCUT  problem: 

CP  —  SDPp  =  {M  :  M  symmetric,  M  P  0,  Mu  =  1,  Vi  =  1,  •  •  •  ,p}. 

This  is  the  well-studied  elliptope  [47],  and  can  be  interpreted  as  the  second  theta  body 
relaxation  of  the  cut  polytope  CPP  [77]. 


Chapter  3 


Rank-Sparsity  Uncertainty  Principles 

and  Matrix  Decomposition 


■  3.1  Introduction 

Complex  systems  and  models  arise  in  a  variety  of  problems  in  science  and  engineering. 
In  many  applications  such  complex  systems  and  models  are  often  composed  of  multiple 
simpler  systems  and  models.  Therefore,  in  order  to  better  understand  the  behavior  and 
properties  of  a  complex  system  a  natural  approach  is  to  decompose  the  system  into 
its  simpler  components.  In  this  chapter  we  consider  matrix  representations  of  systems 
and  statistical  models  in  which  our  matrices  are  formed  by  adding  together  sparse 
and  low-rank  matrices.  We  study  the  problem  of  recovering  the  sparse  and  low-rank 
components  given  no  prior  knowledge  about  the  sparsity  pattern  of  the  sparse  matrix, 
or  the  rank  of  the  low-rank  matrix.  We  propose  a  tractable  convex  program  to  recover 
these  components,  and  provide  sufficient  conditions  under  which  our  procedure  recovers 
the  sparse  and  low-rank  matrices  exactly. 

Such  a  decomposition  problem  arises  in  a  number  of  settings,  with  the  sparse  and 
low-rank  matrices  having  different  interpretations  depending  on  the  application.  In 
a  statistical  model  selection  setting,  the  sparse  matrix  can  correspond  to  a  Gaussian 
graphical  model  [93]  and  the  low-rank  matrix  can  summarize  the  effect  of  latent,  un¬ 
observed  variables  (see  Chapter  4  for  a  detailed  investigation).  In  computational  com¬ 
plexity,  the  notion  of  matrix  rigidity  [138]  captures  the  smallest  number  of  entries  of  a 
matrix  that  must  be  changed  in  order  to  reduce  the  rank  of  the  matrix  below  a  specified 
level  (the  changes  can  be  of  arbitrary  magnitude).  Bounds  on  the  rigidity  of  a  matrix 
have  several  implications  in  complexity  theory  [99].  Similarly,  in  a  system  identifica¬ 
tion  setting  the  low-rank  matrix  represents  a  system  with  a  small  model  order  while 
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the  sparse  matrix  represents  a  system  with  a  sparse  impulse  response.  Decomposing  a 
system  into  such  simpler  components  can  be  used  to  provide  a  simpler,  more  efficient 
description. 

■  3.1.1  Our  results 

Formally  the  decomposition  problem  in  which  we  are  interested  can  be  defined  as  fol¬ 
lows: 

Problem  Given  C  =  A*  +  B *  where  A *  is  an  unknown  sparse  matrix  and  B*  is  an 
unknown  low-rank  matrix,  recover  A *  and  B *  from  C  using  no  additional  information 
on  the  sparsity  pattern  and/or  the  rank  of  the  components. 

In  the  absence  of  any  further  assumptions,  this  decomposition  problem  is  fundamen¬ 
tally  ill-posed.  Indeed,  there  are  a  number  of  scenarios  in  which  a  unique  splitting  of 
C  into  “low-rank”  and  “sparse”  parts  may  not  exist;  for  example,  the  low-rank  matrix 
may  itself  be  very  sparse  leading  to  identihability  issues.  In  order  to  characterize  when 
a  unique  decomposition  is  possible  we  develop  a  notion  of  rank-sparsity  incoherence , 
an  uncertainty  principle  between  the  sparsity  pattern  of  a  matrix  and  its  row/column 
spaces.  This  condition  is  based  on  quantities  involving  the  tangent  spaces  to  the  al¬ 
gebraic  variety  of  sparse  matrices  and  the  algebraic  variety  of  low-rank  matrices  [79]. 
Another  point  of  ambiguity  in  the  problem  statement  is  that  one  could  subtract  a 
nonzero  entry  from  A*  and  add  it  to  B *;  the  sparsity  level  of  A *  is  strictly  improved 
while  the  rank  of  B *  is  increased  by  at  most  1.  Therefore  it  is  in  general  unclear  what 
the  “true”  sparse  and  low-rank  components  are.  We  discuss  this  point  in  greater  detail 
in  Section  3.4.2  following  the  statement  of  the  main  theorem.  In  particular  we  describe 
how  our  identihability  and  recovery  results  for  the  decomposition  problem  are  to  be 
interpreted. 

Two  natural  identihability  problems  may  arise.  The  hrst  one  occurs  if  the  low- 
rank  matrix  itself  is  very  sparse.  In  order  to  avoid  such  a  problem  we  impose  certain 
conditions  on  the  row/column  spaces  of  the  low-rank  matrix.  Specifically,  for  a  matrix 
M  let  T(M )  be  the  tangent  space  at  M  with  respect  to  the  variety  of  all  matrices  with 
rank  less  than  or  equal  to  rank(Af).  Operationally,  T(Af)  is  the  span  of  all  matrices 
with  row-space  contained  in  the  row-space  of  M  or  with  column-space  contained  in  the 
column-space  of  Af;  see  (3.7)  for  a  formal  characterization.  Let  £(Af)  be  defined  as 
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follows: 


£(M)  =  max 

NeT(M),  || JV||  <1 


Halloo 


(3.1) 


Here  ||  •  ||  is  the  spectral  norm  (i.e.,  the  largest  singular  value),  and  ||  •  ||oo  denotes  the 
largest  entry  in  magnitude.  Thus  £(M)  being  small  implies  that  (appropriately  scaled) 
elements  of  the  tangent  space  T(M )  are  “diffuse” ,  i.e.,  these  elements  are  not  too  sparse; 
as  a  result  M  cannot  be  very  sparse.  As  shown  in  Proposition  3.4.3  (see  Section  3.4.3) 
a  low-rank  matrix  M  with  row/column  spaces  that  are  not  closely  aligned  with  the 
coordinate  axes  has  small  £(M). 

The  other  identihability  problem  may  arise  if  the  sparse  matrix  has  all  its  support 
concentrated  in  one  column;  the  entries  in  this  column  could  negate  the  entries  of  the 


corresponding  low-rank  matrix,  thus  leaving  the  rank  and  the  column  space  of  the 
low-rank  matrix  unchanged.  To  avoid  such  a  situation,  we  impose  conditions  on  the 


sparsity  pattern  of  the  sparse  matrix  so  that  its  support  is  not  too  concentrated  in  any 
row/column.  For  a  matrix  M  let  Q(M)  be  the  tangent  space  at  M  with  respect  to  the 
variety  of  all  matrices  with  number  of  nonzero  entries  less  than  or  equal  to  |support(M)|. 
The  space  f i(M)  is  simply  the  set  of  all  matrices  that  have  support  contained  within 
the  support  of  M;  see  (3.5).  Let  / )  be  defined  as  follows: 


MM)  ^ 


max  II  Al| . 

Nen(M),  ||jv||oo<i 


(3.2) 


The  quantity  fi(M)  being  small  for  a  matrix  implies  that  the  spectrum  of  any  element 
of  the  tangent  space  Q(M)  is  “diffuse”,  i.e.,  the  singular  values  of  these  elements  are 
not  too  large.  We  show  in  Proposition  3.4.2  (see  Section  3.4.3)  that  a  sparse  matrix  M 
with  “bounded  degree”  (a  small  number  of  nonzeros  per  row/column)  has  small  fJb(M). 

For  a  given  matrix  M,  it  is  impossible  for  both  quantities  £(M)  and  n(M)  to  be 
simultaneously  small.  Indeed,  we  prove  that  for  any  matrix  M  ^  0  we  must  have  that 
£( M)n(M )  >  1  (see  Theorem  3.3.1  in  Section  3.3.3).  Thus,  this  uncertainty  principle 
asserts  that  there  is  no  nonzero  matrix  M  with  all  elements  in  T(M )  being  diffuse  and 
all  elements  in  fi(M)  having  diffuse  spectra.  As  we  describe  later,  the  quantities  £  and  /r 
are  also  used  to  characterize  fundamental  identihability  in  the  decomposition  problem. 

In  general  solving  the  decomposition  problem  is  intractable;  this  is  due  to  the  fact 
that  it  is  intractable  in  general  to  compute  the  rigidity  of  a  matrix  (see  Section  3.2.2), 
which  can  be  viewed  as  a  special  case  of  the  sparse-plus-low-rank  decomposition  prob¬ 
lem.  Hence,  we  consider  tractable  approaches  employing  recently  well-studied  convex 
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relaxations.  We  formulate  a  convex  optimization  problem  for  decomposition  using  a 
combination  of  the  i\  norm  and  the  nuclear  norm.  For  any  matrix  M  the  i\  norm  is 
given  by 

||M||1  =  ^|My|, 
hj 

and  the  nuclear  norm,  which  is  the  sum  of  the  singular  values,  is  given  by 

k 

where  {oy>;(M)}  are  the  singular  values  of  M.  The  £\  norm  has  been  used  as  an  effective 
surrogate  for  the  number  of  nonzero  entries  of  a  vector,  and  a  number  of  results  pro¬ 
vide  conditions  under  which  this  heuristic  recovers  sparse  solutions  to  ill-posed  inverse 
problems  [29,53,54],  More  recently,  the  nuclear  norm  has  been  shown  to  be  an  effec¬ 
tive  surrogate  for  the  rank  of  a  matrix  [64],  This  relaxation  is  a  generalization  of  the 
previously  studied  trace-heuristic  that  was  used  to  recover  low-rank  positive  semidefi- 
nite  matrices  [108].  Indeed,  several  papers  demonstrate  that  the  nuclear  norm  heuristic 
recovers  low-rank  matrices  in  various  rank  minimization  problems  [30, 121],  Based  on 
these  results,  we  propose  the  following  optimization  formulation  to  recover  A*  and  B * 
given  C  =  A*  +  B*: 

(A,  B)  =  argmin  7||A||i  +  ||f?||* 

(3.3) 

s.t.  A  +  B  =  C. 

Here  7  is  a  parameter  that  provides  a  trade-off  between  the  low-rank  and  sparse  compo¬ 
nents.  This  optimization  problem  is  convex,  and  can  in  fact  be  rewritten  as  a  semidef- 
inite  program  (SDP)  [139]  (see  Appendix  A.l). 

We  prove  that  (A,  B )  =  (A*,  B*)  is  the  unique  optimum  of  (3.3)  for  a  range  of  7  if 
fi(A*)£(B*)  <  |  (see  Theorem  3.4.1  in  Section  3.4.2).  Thus,  the  conditions  for  exact 
recovery  of  the  sparse  and  low-rank  components  via  the  convex  program  (3.3)  involve  the 
tangent-space-based  quantities  defined  in  (3.1)  and  (3.2).  Essentially  these  conditions 
specify  that  each  element  of  fi(A*)  must  have  a  diffuse  spectrum,  and  every  element 
of  T(B *)  must  be  diffuse.  In  a  sense  that  will  be  made  precise  later,  the  condition 
/i(A*)£(B*)  <  |  required  for  the  convex  program  (3.3)  to  provide  exact  recovery  is 
slightly  tighter  than  that  required  for  fundamental  identifiability  in  the  decomposition 
problem.  An  important  feature  of  our  result  is  that  it  provides  a  simple  deterministic 
condition  for  exact  recovery.  In  addition,  note  that  the  conditions  only  depend  on  the 
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row/column  spaces  of  the  low-rank  matrix  B*  and  the  support  of  the  sparse  matrix  A *, 
and  not  the  magnitudes  of  the  nonzero  singular  values  of  B*  or  the  nonzero  entries  of 
A*.  The  reason  for  this  is  that  the  magnitudes  of  the  nonzero  entries  of  A*  and  the 
nonzero  singular  values  of  B *  play  no  role  in  the  subgradient  conditions  with  respect 
to  the  l\  norm  and  the  nuclear  norm. 

In  the  sequel  we  discuss  concrete  classes  of  sparse  and  low-rank  matrices  that  have 
small  p  and  £  respectively.  We  also  show  that  when  the  sparse  and  low-rank  matrices  A* 
and  B *  are  drawn  from  certain  natural  random  ensembles,  then  the  sufficient  conditions 
of  Theorem  3.4.1  are  satisfied  with  high  probability;  consequently,  (3.3)  provides  exact 
recovery  with  high  probability  for  such  matrices. 

■  3.1.2  Previous  work  using  incoherence 

The  concept  of  incoherence  was  studied  in  the  context  of  recovering  sparse  represen¬ 
tations  of  vectors  from  a  so-called  “overcomplete  dictionary”  [52].  More  concretely 
consider  a  situation  in  which  one  is  given  a  vector  formed  by  a  sparse  linear  combina¬ 
tion  of  a  few  elements  from  a  combined  time- frequency  dictionary,  i.e.,  a  vector  formed 
by  adding  a  few  sinusoids  and  a  few  “spikes”;  the  goal  is  to  recover  the  spikes  and 
sinusoids  that  compose  the  vector  from  the  infinitely  many  possible  solutions.  Based 
on  a  notion  of  time-frequency  incoherence,  the  i\  heuristic  was  shown  to  succeed  in 
recovering  sparse  solutions  [51].  Incoherence  is  also  a  concept  that  is  used  in  recent 
work  under  the  title  of  compressed  sensing ,  which  aims  to  recover  “low-dimensional” 
objects  such  as  sparse  vectors  [29,54]  and  low-rank  matrices  [30, 121]  given  incomplete 
observations.  Our  work  is  closer  in  spirit  to  that  in  [52],  and  can  be  viewed  as  a  method 
to  recover  the  “simplest  explanation”  of  a  matrix  given  an  “overcomplete  dictionary” 
of  sparse  and  low-rank  matrix  atoms. 

■  3.1.3  Outline 

In  Section  3.2  we  elaborate  on  the  applications  mentioned  previously,  and  discuss  the 
implications  of  our  results  for  each  of  these  applications.  Section  3.3  formally  describes 
conditions  for  fundamental  identifiability  in  the  decomposition  problem  based  on  the 
quantities  £  and  p  defined  in  (3.1)  and  (3.2).  We  also  provide  a  proof  of  the  rank-sparsity 
uncertainty  principle  of  Theorem  3.3.1.  We  prove  Theorem  3.4.1  in  Section  3.4,  and 
also  provide  concrete  classes  of  sparse  and  low-rank  matrices  that  satisfy  the  sufficient 
conditions  of  Theorem  3.4.1.  Section  3.5  describes  the  results  of  simulations  of  our 
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approach  applied  to  synthetic  matrix  decomposition  problems.  We  conclude  with  a 
discussion  in  Section  3.6.  Appendix  A  provides  additional  details  and  proofs. 

■  3.2  Applications 

In  this  section  we  describe  several  applications  that  involve  decomposing  a  matrix  into 
sparse  and  low-rank  components. 

■  3.2.1  Graphical  modeling  with  latent  variables 

We  begin  with  a  problem  in  statistical  model  selection.  In  many  applications  large 
covariance  matrices  are  approximated  as  low-rank  matrices  based  on  the  assumption 
that  a  small  number  of  latent  factors  explain  most  of  the  observed  statistics  (e.g., 
principal  component  analysis) .  Another  well-studied  class  of  models  are  those  described 
by  graphical  models  [93]  in  which  the  inverse  of  the  covariance  matrix  (also  called  the 
precision  or  concentration  or  information  matrix)  is  assumed  to  be  sparse  (typically 
this  sparsity  is  with  respect  to  some  graph).  Consequently,  a  natural  sparse-plus-low- 
rank  decomposition  problem  arises  in  latent-variable  graphical  model  selection,  which 
we  discuss  in  more  detail  in  Chapter  4. 

■  3.2.2  Matrix  rigidity 

The  rigidity  of  a  matrix  M,  denoted  by  -Rm(&)>  is  the  smallest  number  of  entries  that 
need  to  be  changed  in  order  to  reduce  the  rank  of  M  below  k.  Obtaining  bounds 
on  rigidity  has  a  number  of  implications  in  complexity  theory  [99],  such  as  the  trade¬ 
offs  between  size  and  depth  in  arithmetic  circuits.  However,  computing  the  rigidity 
of  a  matrix  is  intractable  in  general  [38,  101].  For  any  M  6  Rnxn  one  can  check 
that  RM{k)  <  (n  —  k )2  (this  follows  directly  from  a  Schur  complement  argument). 
Generically  every  M  e  Mnxn  is  very  rigid,  i.e.,  R,M(k)  =  ( n  —  k )2  [138],  although  special 
classes  of  matrices  may  be  less  rigid.  We  show  that  the  SDP  (3.3)  can  be  used  to 
compute  rigidity  for  certain  matrices  with  sufficiently  small  rigidity  (see  Section  3.4.4 
for  more  details).  Indeed,  this  convex  program  (3.3)  also  provides  a  certificate  of  the 
sparse  and  low-rank  components  that  form  such  low-rigidity  matrices;  that  is,  the  SDP 
(3.3)  not  only  enables  us  to  compute  the  rigidity  for  certain  matrices  but  additionally 
provides  the  changes  required  in  order  to  realize  a  matrix  of  lower  rank. 
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■  3.2.3  Composite  system  identification 

A  decomposition  problem  can  also  be  posed  in  the  system  identification  setting.  Linear 
time-invariant  (LTI)  systems  can  be  represented  by  Hankel  matrices,  where  the  matrix 
represents  the  input-output  relationship  of  the  system  [131].  Thus,  a  sparse  Hankel 
matrix  corresponds  to  an  LTI  system  with  a  sparse  impulse  response.  A  low-rank 
Hankel  matrix  corresponds  to  a  system  with  small  model  order,  and  provides  a  minimal 
realization  for  a  system  [65].  Given  an  LTI  system  H  as  follows 

H  =  HS  +  Hlr, 

where  Hs  is  sparse  and  Hir  is  low-rank,  obtaining  a  simple  description  of  H  requires 
decomposing  it  into  its  simpler  sparse  and  low-rank  components.  One  can  obtain  these 
components  by  solving  our  rank-sparsity  decomposition  problem.  Note  that  in  practice 
one  can  impose  in  (3.3)  the  additional  constraint  that  the  sparse  and  low-rank  matrices 
have  Hankel  structure. 

■  3.2.4  Partially  coherent  decomposition  in  optical  systems 

We  outline  an  optics  application  that  is  described  in  greater  detail  in  [63].  Optical 
imaging  systems  are  commonly  modeled  using  the  Hopkins  integral  [75],  which  gives  the 
output  intensity  at  a  point  as  a  function  of  the  input  transmission  via  a  quadratic  form. 
In  many  applications  the  operator  in  this  quadratic  form  can  be  well-approximated  by 
a  (finite)  positive  semi-definite  matrix.  Optical  systems  described  by  a  low-pass  filter 
are  called  coherent  imaging  systems,  and  the  corresponding  system  matrices  have  small 
rank.  For  systems  that  are  not  perfectly  coherent  various  methods  have  been  proposed 
to  find  an  optimal  coherent  decomposition  [115],  and  these  essentially  identify  the  best 
approximation  of  the  system  matrix  by  a  matrix  of  lower  rank.  At  the  other  end 
are  incoherent  optical  systems  that  allow  some  high  frequencies,  and  are  characterized 
by  system  matrices  that  are  diagonal.  As  most  real-world  imaging  systems  are  some 
combination  of  coherent  and  incoherent,  it  was  suggested  in  [63]  that  optical  systems 
are  better  described  by  a  sum  of  coherent  and  incoherent  systems  rather  than  by  the 
best  coherent  (i.e. ,  low-rank)  approximation  as  in  [115].  Thus,  decomposing  an  imaging 
system  into  coherent  and  incoherent  components  involves  splitting  the  optical  system 
matrix  into  low-rank  and  diagonal  components.  Identifying  these  simpler  components 
has  important  applications  in  tasks  such  as  optical  microlithography  [75,115]. 
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■  3.3  Rank-Sparsity  Incoherence 

Throughout  this  chapter,  we  restrict  ourselves  to  square  nxn  matrices  to  avoid  cluttered 
notation.  All  our  analysis  extends  to  rectangular  n\  x  n 2  matrices,  if  we  simply  replace 
n  by  max(ni,  712). 

■  3.3.1  Identifiability  issues 

As  described  in  the  introduction,  the  matrix  decomposition  problem  can  be  fundamen¬ 
tally  ill-posed.  We  describe  two  situations  in  which  identifiability  issues  arise.  These 
examples  suggest  the  kinds  of  additional  conditions  that  are  required  in  order  to  ensure 
that  there  exists  a  unique  decomposition  into  sparse  and  low-rank  matrices. 

First,  let  A *  be  any  sparse  matrix  and  let  B*  =  ejej,  where  e*  represents  the  z-tli 
standard  basis  vector.  In  this  case,  the  low-rank  matrix  B*  is  also  very  sparse,  and  a 
valid  sparse-plus-low-rank  decomposition  might  be  A  =  A*  +  e^ej  and  B  =  0.  Thus, 
we  need  conditions  that  ensure  that  the  low-rank  matrix  is  not  too  sparse.  One  way 
to  accomplish  this  is  to  require  that  the  quantity  £(2?*)  be  small.  As  will  be  discussed 
in  Section  3.4.3),  if  the  row  and  column  spaces  of  B *  are  “incoherent”  with  respect  to 
the  standard  basis,  i.e.,  the  row/column  spaces  are  not  aligned  closely  with  any  of  the 
coordinate  axes,  then  £(!?*)  is  small. 

Next,  consider  the  scenario  in  which  B*  is  any  low-rank  matrix  and  A*  =  —vef 
with  v  being  the  first  column  of  B*.  Thus,  C  =  A*  +  B*  has  zeros  in  the  first  column, 
rank(C')  <  rank(R*),  and  C  has  the  same  column  space  as  B*.  Therefore,  a  reasonable 
sparse-plus-low-rank  decomposition  in  this  case  might  be  B  =  B *  +  A*  and  A  =  0. 
Here  rank(H)  =  rank(R*).  Requiring  that  a  sparse  matrix  A *  have  small  /i(A*)  avoids 
such  identifiability  issues.  Indeed  we  show  in  Section  3.4.3  that  sparse  matrices  with 
“bounded  degree”  (i.e.,  few  nonzero  entries  per  row/column)  have  small  [i. 

■  3.3.2  Tangent-space  identifiability 

We  begin  by  describing  the  sets  of  sparse  and  low-rank  matrices.  These  sets  can  be 
considered  either  as  differentiable  manifolds  (away  from  their  singularities)  or  as  alge¬ 
braic  varieties;  we  emphasize  the  latter  viewpoint  here.  Recall  that  an  algebraic  variety 
is  the  solution  set  of  a  system  of  polynomial  equations.  The  set  of  sparse  matrices  and 
the  set  of  low-rank  matrices  can  be  naturally  viewed  as  algebraic  varieties.  Here  we 
describe  these  varieties,  and  discuss  some  of  their  properties.  Of  particular  interest  in 
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this  chapter  are  geometric  properties  of  these  varieties  such  as  the  tangent  space  at  a 
(smooth)  point. 

Let  S(k)  denote  the  set  of  matrices  with  at  most  k  nonzeros: 

S(k)  =  {M  G  Mnxn  |  | support (M) |  <  k}.  (3.4) 

The  set  S(k)  is  an  algebraic  variety,  and  can  in  fact  be  viewed  as  a  union  of  (”, ) 
subspaces  in  Mnxn.  This  variety  has  dimension  k,  and  it  is  smooth  everywhere  except  at 
those  matrices  that  have  support  size  strictly  smaller  than  k.  For  any  matrix  M  G  Rnxn, 
consider  the  variety  <S(|support(M)|);  M  is  a  smooth  point  of  this  variety,  and  the 
tangent  space  at  M  is  given  by 

n(Af)  =  {N  G  Mnxn  I  support  (A/")  C  support(M)}.  (3.5) 

In  words  the  tangent  space  f i(M)  at  a  smooth  point  M  is  given  by  the  set  of  all  matrices 
that  have  support  contained  within  the  support  of  M.  We  view  Q ( M )  as  a  subspace  in 

jjjnxn 

Next  let  C(r)  denote  the  algebraic  variety  of  matrices  with  rank  at  most  r: 

C{r)  =  {M  G  Rnxn  |  rank(M)  <  r}.  (3.6) 

It  is  easily  seen  that  C(r)  is  an  algebraic  variety  because  it  can  be  defined  through  the 
vanishing  of  all  [r  +  1)  x  (r  +  1)  minors.  This  variety  has  dimension  equal  to  r(2n  —  r), 
and  it  is  smooth  everywhere  except  at  those  matrices  that  have  rank  strictly  smaller 
than  r.  Consider  a  rank-r  matrix  M  with  SVD  M  =  UDVT,  where  E/,b  G  Wixr  and 
D  G  Mrxr.  The  matrix  M  is  a  smooth  point  of  the  variety  £(rank(Af)),  and  the  tangent 
space  at  M  with  respect  to  this  variety  is  given  by 

T(Af)  =  {UY?  +  Y2Vt  |  Y1,Y2  G  Mnxr}.  (3.7) 

In  words  the  tangent  space  T(M)  at  a  smooth  point  M  is  the  span  of  all  matrices  that 
have  either  the  same  row-space  as  M  or  the  same  column-space  as  M.  As  with  Q(M) 
we  view  T(M )  as  a  subspace  in  Rnxn. 

Before  analyzing  whether  ( A*,B *)  can  be  recovered  in  general  (for  example,  using 
the  SDP  (3.3)),  we  ask  a  simpler  question.  Suppose  that  we  had  prior  information 
about  the  tangent  spaces  fl(A*)  and  T(B *),  in  addition  to  being  given  C  =  A*  +  B*. 
Can  we  then  uniquely  recover  ( A*,B *)  from  Cl  Assuming  such  prior  knowledge  of 
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the  tangent  spaces  is  unrealistic  in  practice;  however,  we  obtain  useful  insight  into  the 
kinds  of  conditions  required  on  sparse  and  low-rank  matrices  for  exact  decomposition. 
A  necessary  and  sufficient  condition  for  unique  identifiability  of  (A*,B*)  with  respect 
to  the  tangent  spaces  J1(A*)  and  T(B *)  is  that  these  spaces  intersect  transversally: 

n(A*)  n  t(b*)  =  {o}. 

That  is,  the  subspaces  fi(A*)  and  T(B*)  have  a  trivial  intersection.  The  sufficiency  of 
this  condition  for  unique  decomposition  is  easily  seen.  For  the  necessity  part,  suppose 
for  the  sake  of  a  contradiction  that  a  nonzero  matrix  M  belongs  to  f2(A*)  n  T(B*); 
one  can  add  and  subtract  M  from  A*  and  B *  respectively  while  still  having  a  valid 
decomposition,  which  violates  the  uniqueness  requirement.  Therefore  tangent  space 
transversality  is  equivalent  to  a  “linearized”  identifiability  condition  around  (A*,B*). 
Note  that  tangent  space  transversality  is  also  a  sufficient  condition  for  local  identifia¬ 
bility  around  ( A*,B *)  with  respect  to  the  sparse  and  low-rank  matrix  varieties,  based 
on  the  inverse  function  theorem.  The  transversality  condition  does  not,  however,  im¬ 
ply  global  identifiability  with  respect  to  the  sparse  and  low-rank  matrix  varieties.  The 
following  proposition,  proved  in  Appendix  A. 2,  provides  a  simple  condition  in  terms  of 
the  quantities  ffiA*)  and  £(-£>*)  for  the  tangent  spaces  f2(A*)  and  T(B *)  to  intersect 
transversally. 

Proposition  3.3.1.  Given  any  two  matrices  A*  and  B*,  we  have  that 

ffiA*ffi(B*)  <  1  =>  fi(A*)  n  T(B*)  =  {0}, 

where  £(B*)  and  ffiA*)  are  defined  in  (3.1)  and  (3.2),  and  the  tangent  spaces  J2(A*) 
and  T(B *)  are  defined  in  (3.5)  and  (3.7). 

Thus,  both  n(A*)  and  ^(B *)  being  small  implies  that  the  tangent  spaces  f2(A*)  and 
T(B *)  intersect  transversally;  consequently,  we  can  exactly  recover  (A*,  B*)  given 
and  T(B *).  As  we  shall  see,  the  condition  required  in  Theorem  3.4.1  (see  Section  3.4.2) 
for  exact  recovery  using  the  convex  program  (3.3)  will  be  simply  a  mild  tightening  of 
the  condition  required  above  for  unique  decomposition  given  the  tangent  spaces. 

■  3.3.3  Rank-sparsity  uncertainty  principle 

Another  important  consequence  of  Proposition  3.3.1  is  that  we  have  an  elementary 
proof  of  the  following  rank-sparsity  uncertainty  principle. 


Sec.  3.4.  Exact  Decomposition  Using  Semidefinite  Programming 


39 


Theorem  3.3.1.  For  any  matrix  M  0,  we  have  that 

£(M)/x(M)  >  1, 

where  £(M)  and  n(M)  are  as  defined  in  (3.1)  and  (3.2)  respectively. 

Proof:  Given  any  M  /  0  it  is  clear  that  M  G  Q(M)  n  T(M),  i.e. ,  M  is  an  element 
of  both  tangent  spaces.  However  //(M)£(M)  <  1  would  imply  from  Proposition  3.3.1 
that  H(M)  n  T(M)  =  {0},  which  is  a  contradiction.  Consequently,  we  must  have  that 
H(M)£(M)  >  1.  □ 

Hence,  for  any  matrix  M  0  both  n(M)  and  £(M)  cannot  be  simultaneously 
small.  Note  that  Proposition  3.3.1  is  an  assertion  involving  /_/  and  £  for  (in  general) 
different  matrices,  while  Theorem  3.3.1  is  a  statement  about  [i  and  £  for  the  same 
matrix.  Essentially  the  uncertainty  principle  asserts  that  no  matrix  can  be  too  sparse 
while  having  “diffuse”  row  and  column  spaces.  An  extreme  example  is  the  matrix  ejej, 
which  has  the  property  that  /i(ejej)£(ejej)  =  1. 

■  3.4  Exact  Decomposition  Using  Semidefinite  Programming 

We  begin  this  section  by  studying  the  optimality  conditions  of  the  convex  program  (3.3), 
after  which  we  provide  a  proof  of  Theorem  3.4.1  with  simple  conditions  that  guarantee 
exact  decomposition.  Next  we  discuss  concrete  classes  of  sparse  and  low-rank  matrices 
that  satisfy  the  conditions  of  Theorem  3.4.1,  and  can  thus  be  uniquely  decomposed 
using  (3.3). 

■  3.4.1  Optimality  conditions 

The  orthogonal  projection  onto  the  space  f2(A*)  is  denoted  Py,(A*)->  which  simply  sets 
to  zero  those  entries  with  support  not  inside  support  (A*).  The  subspace  orthogonal  to 
H(A*)  is  denoted  and  it  consists  of  matrices  with  complementary  support,  i.e., 

supported  on  support(A*)c.  The  projection  onto  H(A*)C  is  denoted  Pq,[a*)c- 

Similarly  the  orthogonal  projection  onto  the  space  T(B*)  is  denoted  Pt(b*)-  Letting 
B *  =  UHVT  be  the  SVD  of  B*,  we  have  the  following  explicit  relation  for  Pt(b*): 

Pt(b*)(M)  =  PiiM  +  MPy  —  PjjMPy.  (3-8) 

Here  Pjj  =  UUT  and  Py  =  VV1 .  The  space  orthogonal  to  T(B*)  is  denoted  T(B*)±, 
and  the  corresponding  projection  is  denoted  Pt(b*)±  (Af).  The  space  T(B*)-L  consists  of 
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matrices  with  row-space  orthogonal  to  the  row-space  of  B*  and  column-space  orthogonal 
to  the  column-space  of  B*.  We  have  that 

PT(S*)±(M)  =  (. Inxn  -  Pu)M(Inxn  ~  Pv),  (3-9) 

where  Inxn  is  the  n  x  n  identity  matrix. 

Following  standard  notation  in  convex  analysis  [124],  we  denote  the  subdifferential 
of  a  convex  function  /  at  a  point  x  in  its  domain  by  df(x).  The  subdifferential  df(x ) 
consists  of  all  y  such  that 


f(x)  >  f(x)  +  (y,x  -  x),  Vx. 

From  the  optimality  conditions  for  a  convex  program  [13],  we  have  that  ( A*,B *)  is  an 
optimum  of  (3.3)  if  and  only  if  there  exists  a  dual  Q  £  Rnxn  such  that 

Qe  7dp*||r  and  Q  £  0||B*||*.  (3.10) 

From  the  characterization  of  the  subdifferential  of  the  i\  norm,  we  have  that  Q  £ 
7^11  A* ||i  if  and  only  if 

Pn(A*)(Q)  =  7sign(H*),  ||Pn(A*)c(Q)||oo<7-  (3-11) 

Here  sign(A*  •)  equals  +1  if  A*  ■  >  0,  —1  if  A*  -  <  0,  and  0  if  A*  ■  =  0.  We  also  have 
that  Q  £  <9|| £»*[[*  if  and  only  if  [142] 

PT(b*)(Q)  =  UV',  || 

PriB*)1-  (Q)||<1.  (3.12) 

Note  that  these  are  necessary  and  sufficient  conditions  for  (A*,B*)  to  be  an  optimum 
of  (3.3).  The  following  proposition  provides  sufficient  conditions  for  ( A*,B *)  to  be  the 
unique  optimum  of  (3.3),  and  it  involves  a  slight  tightening  of  the  conditions  (3.10), 
(3.11),  and  (3.12). 

Proposition  3.4.1.  Suppose  that  C  =  A*  +  B* .  Then  (. A,B )  =  ( A*,B *)  is  the  unique 
optimizer  of  (3.3)  if  the  following  conditions  are  satisfied: 

1.  n{A*)nT(B*)  =  {0}. 

2.  There  exists  a  dual  Q  £  Mnxn  such  that 


(a)  PT(B*)(Q)  =  UV' 
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Figure  3.1.  Geometric  representation  of  optimality  conditions:  Existence  of  a  dual  Q.  The  arrows  de¬ 
note  orthogonal  projections  -  every  projection  must  satisfy  a  condition  (according  to  Proposition  3.4.1), 
which  is  described  next  to  each  arrow. 


0>)  Pq(a*){Q)  =7sign(^*) 

(c)  \\Pt(b*)j-(.Q)\\  <  1 

(d)  ||-Pf2(A*)<=(Q)||oo  <  7 

The  proof  of  the  proposition  can  be  found  in  Appendix  A. 2.  Figure  3.1  provides  a 
visual  representation  of  these  conditions.  In  particular,  we  see  that  the  spaces  £l(A*) 
and  T(B*)  intersect  transversely  (part  (1)  of  Proposition  3.4.1).  One  can  also  intuitively 
see  that  guaranteeing  the  existence  of  a  dual  Q  with  the  requisite  conditions  (part  (2)  of 
Proposition  3.4.1)  is  perhaps  easier  if  the  intersection  between  I2(A*)  and  T(B*)  is  more 
transverse.  Note  that  condition  (1)  of  this  proposition  essentially  requires  identifiability 
with  respect  to  the  tangent  spaces,  as  discussed  in  Section  3.3.2. 

■  3.4.2  Sufficient  conditions  based  on  and  £(B*) 

Next  we  provide  simple  sufficient  conditions  on  A*  and  B *  that  guarantee  the  existence 
of  an  appropriate  dual  Q  (as  required  by  Proposition  3.4.1).  Given  matrices  A*  and 
B *  with  fj,(A*)£(B*)  <  1,  we  have  from  Proposition  3.3.1  that  fl(A*)  n  T(B *)  =  {0}, 
i.e. ,  condition  (1)  of  Proposition  3.4.1  is  satisfied.  We  prove  that  if  a  slightly  stronger 
condition  holds,  there  exists  a  dual  Q  that  satisfies  the  requirements  of  condition  (2)  of 
Proposition  3.4.1. 
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Theorem  3.4.1.  Given  C  =  A*  +  B*  with 

n(A‘){(B')  <  i 

the  unique  optimum  ( A ,  B)  of  (3.3)  is  (A*,  B *)  for  the  following  range  of  7: 

,,  r  (  cm  1-3  »(a*)z(b*)\ 

Vi  -  4 MA*)a(B*y  »(a*)  )' 

Specifically  7  =  4*)p-p  for  an22  c^°*ce  °f  P  £  [0, 1]  is  always  inside  the  above  range, 

and  thus  guarantees  exact  recovery  of  ( A*,B *).  For  example  7  =  2)4 A*)  a^ways 

guarantees  exact  recovery  of  (A*,  B*). 

Recall  from  the  discussion  in  Section  3.3.2  and  from  Proposition  3.3.1  that  n(A*)£(B*) 
1  is  sufficient  to  ensure  that  the  tangent  spaces  12(A*)  and  T(B *)  have  a  transverse  inter¬ 
section,  which  implies  that  ( A*,B *)  are  locally  identifiable  and  can  be  recovered  given 
C  =  A*  +  B*  along  with  side  information  about  the  tangent  spaces  12(A*)  and  T(B*). 
Theorem  3.4.1  asserts  that  if  p(A*)f(B*)  <  g,  i.e.,  if  the  tangent  spaces  f2(A*)  and 
T(B*)  are  sufficiently  transverse ,  then  the  SDP  (3.3)  succeeds  in  recovering  (y4*,lF) 
without  any  information  about  the  tangent  spaces. 

The  proof  of  this  theorem  can  be  found  in  Appendix  A. 2.  The  main  idea  behind 
the  proof  is  that  we  only  consider  candidates  for  the  dual  Q  that  lie  in  the  direct  sum 
f2(y4*)  ®T(B‘)  of  the  tangent  spaces.  Since  p(Ak)f(B*)  <  g,  we  have  from  Proposi¬ 
tion  3.3.1  that  the  tangent  spaces  f2(A*)  and  T(B*)  have  a  transverse  intersection,  i.e., 
f2(A*)  n  T(B*)  =  {0}.  Therefore,  there  exists  a  unique  element  Q  €  fi(A*)  ©  T(B *) 
that  satisfies  Pt(b*)(Q )  =  UV'  and  P^(a*){Q)  =  7sign(A*).  The  proof  proceeds  by 
showing  that  if  p,(A*)£(B*)  <  g  then  the  projections  of  this  Q  onto  the  orthogonal 
spaces  f2(y4*)c  and  T(F*)_L  are  small,  thus  satisfying  condition  (2)  of  Proposition  3.4.1. 

Remarks  We  discuss  here  the  manner  in  which  our  results  are  to  be  interpreted.  Given 
a  matrix  C  =  A*+B*  with  A*  sparse  and  B *  low-rank,  there  are  a  number  of  alternative 
decompositions  of  C  into  “sparse”  and  “low-rank”  components.  For  example,  one  could 
subtract  one  of  the  nonzero  entries  from  the  matrix  A*  and  add  it  to  B *;  thus,  the 
sparsity  level  of  A*  is  strictly  improved,  while  the  rank  of  the  modified  B*  increases  by 
at  most  1.  In  fact  one  could  construct  many  such  alternative  decompositions.  Therefore, 
it  may  a  priori  be  unclear  which  of  these  many  decompositions  is  the  “correct”  one. 
To  clarify  this  issue  consider  a  matrix  C  =  A*  +  B *  that  is  composed  of  the  sum 
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of  a  sparse  A*  with  small  and  a  low-rank  B *  with  small  £(2?*).  Recall  that  a 

sparse  matrix  having  a  small  g  implies  that  the  the  sparsity  pattern  of  the  matrix  is 
“diffuse,”  i.e.,  no  row/column  contains  too  many  non-zeros  (see  Proposition  3.4.2  in 
Section  3.4.3  for  a  precise  characterization).  Similarly,  a  low-rank  matrix  with  small  £ 
has  “diffuse”  row/column  spaces,  i.e.,  the  row/column  spaces  are  not  aligned  with  any  of 
the  coordinate  axes  and  as  a  result  do  not  contain  sparse  vectors  (see  Proposition  3.4.3 
in  Section  3.4.3  for  a  precise  characterization).  Now  let  C  =  A  +  B  be  an  alternative 
decomposition  with  some  of  the  entries  of  A *  moved  to  B*.  Although  the  new  A  has 
a  smaller  support  contained  strictly  within  the  support  of  A*  (and  consequently,  a 
smaller  //(A)),  the  new  low-rank  matrix  B  has  sparse  vectors  in  its  row  and  column 
spaces.  Consequently  we  have  that  £(5)  3>  £(2?*).  Thus,  while  (A,  B)  is  also  a  sparse- 
plus-low-rank  decomposition,  it  is  not  a  diffuse  sparse-plus-low-rank  decomposition,  in 
that  both  the  sparse  matrix  A  and  the  low-rank  matrix  B  do  not  simultaneously  have 
diffuse  supports  and  row/column  spaces  respectively.  Also  the  opposite  situation  of 
removing  a  rank-1  term  from  the  SVD  of  the  low-rank  matrix  B *  and  moving  it  to 
A *  to  form  a  new  decomposition  (A,  B )  (now  with  B  having  strictly  smaller  rank  than 
B*)  faces  a  similar  problem.  In  this  case  B  has  strictly  smaller  rank  than  B*,  and  also 
by  construction  a  smaller  £(R).  However  the  original  low-rank  matrix  B*  has  a  small 
£(£»*)  and  thus  has  diffuse  row/column  spaces;  therefore  the  rank-1  term  that  is  added 
to  A*  will  not  be  sparse,  and  consequently  the  new  matrix  A  will  have  //(A)  3>  / i(A *). 
Hence  the  key  point  is  that  these  alternate  decompositions  (A,  B )  do  not  satisfy  the 
property  that  /x(A)£(H)  <  Thus,  our  result  is  to  be  interpreted  as  follows:  Given 
a  matrix  C  =  A*  +  B*  formed  by  adding  a  sparse  matrix  A*  with  diffuse  support 
and  a  low-rank  matrix  B *  with  diffuse  row/column  spaces,  the  convex  program  that  is 
studied  in  this  chapter  will  recover  this  diffuse  decomposition  over  the  many  possible 
alternative  decompositions  into  sparse  and  low-rank  components  as  none  of  these  have 
the  property  of  both  components  being  simultaneously  diffuse.  Indeed  in  applications 
such  as  graphical  model  selection  (see  Section  3.2.1)  it  is  precisely  such  a  “diffuse” 
decomposition  that  one  seeks  to  recover. 

A  related  question  is  given  a  decomposition  C  =  A*  +  B *  with  /x(A*)£(R*)  < 
j. ,  do  there  exist  small,  local  perturbations  of  A*  and  B *  that  give  rise  to  alternate 
decompositions  (A,  B)  with  /i(A)£(B)  <  |?  Suppose  B*  is  slightly  perturbed  along  the 
variety  of  rank-constrained  matrices  to  some  B.  This  ensures  that  the  tangent  space 
varies  smoothly  from  T(B *)  to  T(B),  and  consequently  that  £(R)  ~  £(£?*).  However, 
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compensating  for  this  by  changing  A*  to  A*  +  (B*  —  B)  moves  A*  outside  the  variety  of 
sparse  matrices.  This  is  because  B*  —  B  is  not  sparse.  Thus  the  dimension  of  the  tangent 
space  fl(A*  +  B*  —  B)  is  much  greater  than  that  of  the  tangent  space  Q(A*),  as  a  result 
of  which  p(A*  +  B*  —  B)  3>  p(A*)-,  therefore  we  have  that  £( B)n(A *  +  B*  —  B)  ^  g. 
The  same  reasoning  holds  in  the  opposite  scenario.  Consider  perturbing  A*  slightly 
along  the  variety  of  sparse  matrices  to  some  A.  While  this  ensures  that  p(A)  ps  p(A*), 
changing  B *  to  f?*+(A*  —  A)  moves  B*  outside  the  variety  of  rank-constrained  matrices. 
Therefore  the  dimension  of  the  tangent  space  T(B*+A*  — A)  is  much  greater  than  that  of 
T(B*),  and  also  T(B*  +  A*  —  A)  contains  sparse  matrices,  resulting  in  £(B*  +  A*  —  A)  3> 
£(£?*);  consequently  we  have  that  p(A)£(B*  +  A*  —  A)  3>  |. 

■  3.4.3  Sparse  and  low-rank  matrices  with  p(A*)£(B*)  <  | 

We  discuss  concrete  classes  of  sparse  and  low-rank  matrices  that  satisfy  the  sufficient 
condition  of  Theorem  3.4.1  for  exact  decomposition.  We  begin  by  showing  that  sparse 
matrices  with  “bounded  degree”,  i.e. ,  bounded  number  of  nonzeros  per  row/column, 
have  small  fi. 

Proposition  3.4.2.  Let  A  €  Mnxn  be  any  matrix  with  at  most  degmax(A)  nonzero 
entries  per  row/column,  and  with  at  least  degmin(A)  nonzero  entries  per  row /column. 
With  p(A)  as  defined  in  (3.2),  we  have  that 

* l(T'min  ( -  b  ’  pl^A)  '  deg  max  (A). 

See  Appendix  A. 2  for  the  proof.  Note  that  if  A  6  Rnxn  has  full  support,  i.e., 
fi(A)  =  Mnxn,  then  /j(A)  =  n.  Therefore,  a  constraint  on  the  number  of  zeros  per 
row/column  provides  a  useful  bound  on  p.  We  emphasize  here  that  simply  bounding 
the  number  of  nonzero  entries  in  A  does  not  suffice;  the  sparsity  pattern  also  plays  a 
role  in  determining  the  value  of  p. 

Next  we  consider  low-rank  matrices  that  have  small  (.  Specifically,  we  show  that 
matrices  with  row  and  column  spaces  that  are  incoherent  with  respect  to  the  standard 
basis  have  small  £.  We  measure  the  incoherence  of  a  subspace  S  C  Rn  as  follows: 

/3(S)  =  max  ||Psei||2,  (3.13) 

i 

where  e*  is  the  f’th  standard  basis  vector,  Pg  denotes  the  projection  onto  the  subspace 
S,  and  ||  •  || 2  denotes  the  vector  £2  norm.  This  definition  of  incoherence  also  played  an 
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important  role  in  the  results  in  [30].  A  small  value  of  /3(S)  implies  that  the  subspace  S 
is  not  closely  aligned  with  any  of  the  coordinate  axes.  In  general  for  any  fc-dimensional 
subspace  S,  we  have  that 

a/1  <  P(S)  <  1, 

V  n 

where  the  lower  bound  is  achieved,  for  example,  by  a  subspace  that  spans  any  k  columns 
of  an  n  x  n  orthonormal  Hadanrard  matrix,  while  the  upper  bound  is  achieved  by  any 
subspace  that  contains  a  standard  basis  vector.  Based  on  the  definition  of  (3(S),  we 
define  the  incoherence  of  the  row/column  spaces  of  a  matrix  B  £  Mnxn  as 

inc(-B)  A  max { fj ( r ow- sp ace ( £> ) ) ,  /3 (column-space ( 5 ) ) } •  (3-14) 

If  the  SVD  of  B  =  UT, VT  then  row-space(-B)  =  span(U)  and  column-space ( B )  = 
span(C/).  We  show  in  Appendix  A. 2  that  matrices  with  incoherent  row/column  spaces 
have  small  £;  the  proof  technique  for  the  lower  bound  here  was  suggested  by  Ben 
Recht  [120]. 

Proposition  3.4.3.  Let  B  £  Mnxn  be  any  matrix  with  inc(R)  defined  as  in  (3.14),  and 
£(B)  defined  as  in  (3.1).  We  have  that 

inc (B)  <  £(B)  <  2  inc(R). 

If  B  £  Mnxn  is  a  full-rank  matrix  or  a  matrix  such  as  e\ef,  then  £ (B )  =  1.  Therefore, 
a  bound  on  the  incoherence  of  the  row/column  spaces  of  B  is  important  in  order  to 
bound  £.  Using  Propositions  3.4.2  and  3.4.3  along  with  Theorem  3.4.1  we  have  the 
following  corollary,  which  states  that  sparse  bounded-degree  matrices  and  low-rank 
matrices  with  incoherent  row/column  spaces  can  be  uniquely  decomposed. 

Corollary  3.4.1.  Let  C  =  A*  +  B*  with  degmax(A*)  being  the  maximum  number  of 
nonzero  entries  per  row/column  of  A*  and  inc (B*)  being  the  maximum  incoherence  of 
the  row/column  spaces  of  B*  (as  defined  by  (3.14))-  If  we  have  that 

degmax(A*)  inc (B*)  < 

then  the  unique  optimum  of  the  convex  program  (3.3)  is  (A,  B)  =  ( A*,B *)  for  a  range 
of  values  of  7: 

_ 2  inc (ff*) _  1-6  degmax(A*)  inc(R*) 

1  -  8  degmax(A*)  inc  (R*)  ’  degmax(A*) 
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Specifically  7  =  (9  dcg^CA*))'1-?  for  an ^  c^°^ce  °f  P  *=  [0, 1]  is  always  inside  the  above 
range,  and  thus  guarantees  exact  recovery  of 

We  emphasize  that  this  is  a  result  with  deterministic  sufficient  conditions  on  exact 
decomposability. 

■  3.4.4  Decomposing  random  sparse  and  low-rank  matrices 

Next  we  show  that  sparse  and  low-rank  matrices  drawn  from  certain  natural  random 
ensembles  satisfy  the  sufficient  conditions  of  Corollary  3.4.1  with  high  probability.  We 
first  consider  random  sparse  matrices  with  a  fixed  number  of  nonzero  entries. 

Random  sparsity  model  The  matrix  A *  is  such  that  support  (A*)  is  chosen  uniformly  at 
random  from  the  collection  of  all  support  sets  of  size  m.  There  is  no  assumption  made 
about  the  values  of  A*  at  locations  specified  by  support  (A*). 

Lemma  3.4.1.  Suppose  that  A *  e  Mnxn  is  drawn  according  to  the  random  sparsity 
model  with  m  nonzero  entries.  Let  degmax(A*j  be  the  maximum  number  of  nonzero 
entries  in  each  row/column  of  A* .  We  have  that 

YU 

degmax(A*)  <  —  logn, 
n 

with  probability  greater  than  1  —  0(n~a )  for  m  =  O(an). 

The  proof  of  this  lemma  follows  from  a  standard  balls  and  bins  argument,  and  can 
be  found  in  several  references  (see  for  example  [19]). 

Next  we  consider  low-rank  matrices  in  which  the  singular  vectors  are  chosen  uni¬ 
formly  at  random  from  the  set  of  all  partial  isometries.  Such  a  model  was  considered  in 
recent  work  on  the  matrix  completion  problem  [30],  which  aims  to  recover  a  low-rank 
matrix  given  observations  of  a  subset  of  entries  of  the  matrix. 

Random  orthogonal  model  [30]  A  rank-A;  matrix  B*  €  Mnxn  with  SVD  B *  =  UY>V' 
is  constructed  as  follows:  The  singular  vectors  U,  V  €  Mnxfc  are  drawn  uniformly  at 
random  from  the  collection  of  rank-A;  partial  isometries  in  M.nxk.  The  choices  of  U  and 
V  need  not  be  mutually  independent.  No  restriction  is  placed  on  the  singular  values. 

As  shown  in  [30],  low-rank  matrices  drawn  from  such  a  model  have  incoherent 
row/column  spaces. 
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Lemma  3.4.2.  Suppose  that  a  rank-k  matrix  B*  £  Mnxn  is  drawn  according  to  the 
random  orthogonal  model.  Then  we  have  that  that  inc(i?*)  (defined  by  (3.14))  is  bounded 
as 


inc  (B* 


< 

r^i 


max(/c,  logn) 


n 


with  probability  greater  than  1  —  0(n  3  logn). 


Applying  these  two  results  in  conjunction  with  Corollary  3.4.1,  we  have  that  sparse 
and  low-rank  matrices  drawn  from  the  random  sparsity  model  and  the  random  orthog¬ 
onal  model  can  be  uniquely  decomposed  with  high  probability. 


Corollary  3.4.2.  Suppose  that  a  rank-k  matrix  B*  £  Mnxn  is  drawn  from  the  random 
orthogonal  model,  and  that  A*  £  Mnxn  is  drawn  from  the  random  sparsity  model  with 
m  nonzero  entries.  Given  C  =  A*  +  B* ,  there  exists  a  range  of  values  for  7  (given  by 
(3.15))  so  that  (A,  B)  =  ( A*,B *)  is  the  unique  optimum  of  the  SDP  (3.3)  with  high 
probability  (given  by  the  bounds  in  Lemma  3.4-1  and  Lemma  3-4-2)  provided 


110  ^  - - . 

log  ny  max(fc,  log  n) 

In  particular,  7  ~  ( ma^Uog°n  ^  )  5  guarantees  exact  recovery  of  ( A *,  B*). 

Thus,  for  matrices  B *  with  rank  k  smaller  than  n  the  SDP  (3.3)  yields  exact  recovery 
with  high  probability  even  when  the  size  of  the  support  of  A *  is  super-linear  in  n. 


Implications  for  the  matrix  rigidity  problem  Corollary  3.4.2  has  implications  for  the  ma¬ 
trix  rigidity  problem  discussed  in  Section  3.2.  Recall  that  Rm(^)  is  the  smallest  num¬ 
ber  of  entries  of  M  that  need  to  be  changed  to  reduce  the  rank  of  M  below  k  (the 
changes  can  be  of  arbitrary  magnitude).  A  generic  matrix  M  £  Rnxn  has  rigidity 
Rm(L)  =  (n  —  k )2  [138].  However,  special  structured  classes  of  matrices  can  have  low 
rigidity.  Consider  a  matrix  M  formed  by  adding  a  sparse  matrix  drawn  from  the  ran¬ 
dom  sparsity  model  with  support  size  0(  1^7),  and  a  low-rank  matrix  drawn  from  the 
random  orthogonal  model  with  rank  en  for  some  fixed  e  >  0.  Such  a  matrix  has  rigid¬ 
ity  i?A/(en)  =  ^(p^))  and  one  can  recover  the  sparse  and  low-rank  components  that 
compose  M  with  high  probability  by  solving  the  SDP  (3.3).  To  see  this,  note  that 


1  c; 

n  n 

logn  ~  log n-*/ max(en, log n) 


n 


1.5 


log  ny/en' 
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Figure  3.2.  For  each  value  of  m,  k,  we  generate  25  x  25  random  m-sparse  A*  and  random  rank-fc  B* 
and  attempt  to  recover  ( A* ,  B *)  from  C  =  A*  +  B*  using  (3.3).  For  each  value  of  m,k  we  repeated 
this  procedure  10  times.  The  figure  shows  the  probability  of  success  in  recovering  ( A* ,  B *)  using  (3.3) 
for  various  values  of  m  and  k.  White  represents  a  probability  of  success  of  1,  while  black  represents  a 
probability  of  success  of  0. 


which  satisfies  the  sufficient  condition  of  Corollary  3.4.2  for  exact  recovery.  Therefore, 
while  the  rigidity  of  a  matrix  is  intractable  to  compute  in  general  [38, 101],  for  such 
low-rigidity  matrices  M  one  can  compute  the  rigidity  in  fact  the  SDP  (3.3) 

provides  a  certificate  of  the  sparse  and  low-rank  matrices  that  form  the  low  rigidity 
matrix  M. 


■  3.5  Simulation  Results 


We  confirm  the  theoretical  predictions  in  this  chapter  with  some  simple  experimental 
results.  We  also  present  a  heuristic  to  choose  the  trade-off  parameter  7.  All  our  sim¬ 
ulations  were  performed  using  YALMIP  [98]  and  the  SDPT3  software  [136]  for  solving 
SDPs. 

In  the  first  experiment  we  generate  random  25  x  25  matrices  according  to  the  ran¬ 
dom  sparsity  and  random  orthogonal  models  described  in  Section  3.4.4.  To  generate 
a  random  rank- A;  matrix  B *  according  to  the  random  orthogonal  model,  we  generate 
1,7  6  M25x/>'  with  i.i.d.  Gaussian  entries  and  set  B*  =  XYT .  To  generate  an  m-sparse 
matrix  A*  according  to  the  random  sparsity  model,  we  choose  a  support  set  of  size  m 
uniformly  at  random  and  the  values  within  this  support  are  i.i.d.  Gaussian.  The  goal 
is  to  recover  (A*,  B *)  from  C  =  A*  +  B*  using  the  SDP  (3.3).  Let  tol7  be  defined  as: 


U-A+\\F 

p*" 


B  -  B*\\f 


(3.16) 


where  (A,  B)  is  the  solution  of  (3.3),  and  |j  •  || p  is  the  Frobenius  norm.  We  declare 
success  in  recovering  (A*,  B *)  if  tol7  <  10-3.  (We  discuss  the  issue  of  choosing  7  in  the 
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Figure  3.3.  Comparison  between  tolt  and  did'*  for  a  randomly  generated  example  with  n  =  25,  m  = 
25,  k  =  2. 


next  experiment.)  Figure  3.2  shows  the  success  rate  in  recovering  for  various 

values  of  m  and  k  (averaged  over  10  experiments  for  each  m,  k).  Thus  we  see  that  one 
can  recover  sufficiently  sparse  A*  and  sufficiently  low-rank  B *  from  C  =  A*  +  B *  using 
(3.3). 

Next  we  consider  the  problem  of  choosing  the  trade-off  parameter  7.  Based  on 
Theorem  3.4.1  we  know  that  exact  recovery  is  possible  for  a  range  of  7.  Therefore,  one 
can  simply  check  the  stability  of  the  solution  ( A ,  B )  as  7  is  varied  without  knowing  the 
appropriate  range  for  7  in  advance.  To  formalize  this  scheme  we  consider  the  following 
SDP  for  t  £  [0, 1],  which  is  a  slightly  modified  version  of  (3.3): 

(At,Bt)  =  argmin  t||^4||i  +  (1  -  t)\\B\\* 

A,B 

s.t.  A  +  B  =  C.  (3.17) 

There  is  a  one-to-one  correspondence  between  (3.3)  and  (3.17)  given  by  t  =  The 
benefit  in  looking  at  (3.17)  is  that  the  range  of  valid  parameters  is  compact,  i.e.,  t  £ 
[0, 1],  as  opposed  to  the  situation  in  (3.3)  where  7  £  [0, 00).  We  compute  the  difference 
between  solutions  for  some  t  and  t  —  e  as  follows: 

difft  =  (II it-e  -  At\\F)  +  (II Bt_e  -  Bt ||F),  (3.18) 

where  e  >  0  is  some  small  fixed  constant,  say  e  =  0.01.  We  generate  a  random  A*  £ 
U 25x25  that  is  25-sparse  and  a  random  B *  £  K25x25  with  rank  =  2  as  described  above. 
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Given  C  =  A*  +  B*,  we  solve  (3.17)  for  various  values  of  t.  Figure  3.3  shows  two 
curves  -  one  is  tol*  (which  is  defined  analogous  to  tol7  in  (3.16))  and  the  other  is 
difft-  Clearly  we  do  not  have  access  to  tolj  in  practice.  However,  we  see  that  diff t 
is  near  zero  in  exactly  three  regions.  For  sufficiently  small  t  the  optimal  solution  to 
(3.17)  is  =  {A*  +  B*,  0),  while  for  sufficiently  large  t  the  optimal  solution  is 

(At,  Bt)  =  (0,  A*  +  B*).  As  seen  in  the  figure,  difft  stabilizes  for  small  and  large  t.  The 
third  “middle”  range  of  stability  is  where  we  typically  have  (At,  Bt)  =  (A*,  B*).  Notice 
that  outside  of  these  three  regions  difft  is  not  close  to  0  and  in  fact  changes  rapidly. 
Therefore  if  a  reasonable  guess  for  t  (or  7)  is  not  available,  one  could  solve  (3.17)  for 
a  range  of  t  and  choose  a  solution  corresponding  to  the  “middle”  range  in  which  difft 
is  stable  and  near  zero.  A  related  method  to  check  for  stability  is  to  compute  the 
sensitivity  of  the  cost  of  the  optimal  solution  with  respect  to  7,  which  can  be  obtained 
from  the  dual  solution. 

■  3.6  Discussion 

We  have  studied  the  problem  of  exactly  decomposing  a  given  matrix  C  =  A*  +  B *  into 
its  sparse  and  low-rank  components  A*  and  B*.  This  problem  arises  in  a  number  of 
applications  in  model  selection,  system  identification,  complexity  theory,  and  optics. 
We  characterized  fundamental  identifiability  in  the  decomposition  problem  based  on 
a  notion  of  rank-sparsity  incoherence,  which  relates  the  sparsity  pattern  of  a  matrix 
and  its  row/column  spaces  via  an  uncertainty  principle.  As  the  general  decomposition 
problem  is  intractable  to  solve,  we  propose  a  natural  SDP  relaxation  (3.3)  to  solve 
the  problem,  and  provide  sufficient  conditions  on  sparse  and  low-rank  matrices  so  that 
the  SDP  exactly  recovers  such  matrices.  Our  sufficient  conditions  are  deterministic  in 
nature;  they  essentially  require  that  the  sparse  matrix  must  have  support  that  is  not 
too  concentrated  in  any  row/column,  while  the  low-rank  matrix  must  have  row/column 
spaces  that  are  not  closely  aligned  with  the  coordinate  axes.  Our  analysis  centers 
around  studying  the  tangent  spaces  with  respect  to  the  algebraic  varieties  of  sparse 
and  low-rank  matrices.  Indeed  the  sufficient  conditions  for  identifiability  and  for  exact 
recovery  using  the  SDP  can  also  be  viewed  as  requiring  that  certain  tangent  spaces  have 
a  transverse  intersection.  The  implications  of  our  results  for  the  matrix  rigidity  problem 
are  also  demonstrated.  An  interesting  problem  for  further  research  is  the  development 
of  special-purpose  algorithms  that  take  advantage  of  structure  in  (3.3)  to  provide  a 
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more  efficient  solution  than  a  general-purpose  SDP  solver. 
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Chapter  4 


Latent  Variable  Graphical  Model 
Selection  via  Convex  Optimization 


■  4.1  Introduction 

Statistical  model  selection  in  the  high-dimensional  regime  arises  in  a  number  of  ap¬ 
plications.  In  many  data  analysis  problems  in  geophysics,  radiology,  genetics,  climate 
studies,  and  image  processing,  the  number  of  samples  available  is  comparable  to  or  even 
smaller  than  the  number  of  variables.  However,  it  is  well-known  that  empirical  statis¬ 
tics  such  as  sample  covariance  matrices  are  not  well-behaved  when  both  the  number  of 
samples  and  the  number  of  variables  are  large  and  comparable  to  each  other  (see  [103]). 
Model  selection  in  such  a  setting  is  therefore  both  challenging  and  of  great  interest.  In 
order  for  model  selection  to  be  well-posed  given  limited  information,  a  key  assumption 
that  is  often  made  is  that  the  underlying  model  to  be  estimated  only  has  a  few  de¬ 
grees  of  freedom.  Common  assumptions  are  that  the  data  are  generated  according  to  a 
graphical  model,  or  a  stationary  time-series  model,  or  a  simple  factor  model  with  a  few 
latent  variables.  Sometimes  geometric  assumptions  are  also  made  in  which  the  data  are 
viewed  as  samples  drawn  according  to  a  distribution  supported  on  a  low-dimensional 
manifold. 

A  model  selection  problem  that  has  received  considerable  attention  recently  is  the 
estimation  of  covariance  matrices  in  the  high-dimensional  setting.  As  the  sample  co- 
variance  matrix  is  poorly  behaved  in  such  a  regime  [85,103],  some  form  of  regularization 
of  the  sample  covariance  is  adopted  based  on  assumptions  about  the  true  underlying 
covariance  matrix.  For  example  approaches  based  on  banding  the  sample  covariance 
matrix  [15]  have  been  proposed  for  problems  in  which  the  variables  have  a  natural  or¬ 
dering  (e.g.,  times  series),  while  “permutation-invariant”  methods  that  use  thresholding 
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are  useful  when  there  is  no  natural  variable  ordering  [16,61].  These  approaches  pro¬ 
vide  consistency  guarantees  under  various  sparsity  assumptions  on  the  true  covariance 
matrix.  Other  techniques  that  have  been  studied  include  methods  based  on  shrink¬ 
age  [94,  145]  and  factor  analysis  [62].  A  number  of  papers  have  studied  covariance 
estimation  in  the  context  of  Gaussian  graphical  model  selection.  In  a  Gaussian  graphi¬ 
cal  model  the  inverse  of  the  covariance  matrix,  also  called  the  concentration  matrix,  is 
assumed  to  be  sparse,  and  the  sparsity  pattern  reveals  the  conditional  independence  re¬ 
lations  satisfied  by  the  variables.  The  model  selection  method  usually  studied  in  such  a 
setting  is  0 -regularized  maximum-likelihood,  with  the  i\  penalty  applied  to  the  entries 
of  the  inverse  covariance  matrix  to  induce  sparsity.  The  consistency  properties  of  such 
an  estimator  have  been  studied  [92,119,126],  and  under  suitable  conditions  [92,119]  this 
estimator  is  also  “sparsistent” ,  i.e. ,  the  estimated  concentration  matrix  has  the  same 
sparsity  pattern  as  the  true  model  from  which  the  samples  are  generated.  An  alterna¬ 
tive  approach  to  ^-regularized  maximum-likelihood  is  to  estimate  the  sparsity  pattern 
of  the  concentration  matrix  by  performing  regression  separately  on  each  variable  [107]; 
while  such  a  method  consistently  estimates  the  sparsity  pattern,  it  does  not  directly 
provide  estimates  of  the  covariance  or  concentration  matrix. 

In  many  applications  throughout  science  and  engineering,  a  challenge  is  that  one 
may  not  have  access  to  observations  of  all  the  relevant  phenomena,  i.e.,  some  of  the 
relevant  variables  may  be  hidden  or  unobserved.  Such  a  scenario  arises  in  data  analysis 
tasks  in  psychology,  computational  biology,  and  economics.  In  general  latent  variables 
pose  a  significant  difficulty  for  model  selection  because  one  may  not  know  the  number  of 
relevant  latent  variables,  nor  the  relationship  between  these  variables  and  the  observed 
variables.  Typical  algorithmic  methods  that  try  to  get  around  this  difficulty  usually 
fix  the  number  of  latent  variables  as  well  as  the  some  structural  relationship  between 
latent  and  observed  variables  (e.g.,  the  graphical  model  structure  between  latent  and 
observed  variables),  and  use  the  EM  algorithm  to  fit  parameters  [44].  This  approach 
suffers  from  the  problem  that  one  optimizes  non-convex  functions,  and  thus  one  may  get 
stuck  in  sub-optimal  local  minima.  An  alternative  method  that  has  been  suggested  is 
based  on  a  greedy,  local,  combinatorial  heuristic  that  assigns  latent  variables  to  groups 
of  observed  variables,  based  on  some  form  of  clustering  of  the  observed  variables  [60]; 
however,  this  approach  has  no  consistency  guarantees. 

In  this  chapter  we  study  the  problem  of  latent- variable  graphical  model  selection  in 
the  setting  where  all  the  variables,  both  observed  and  hidden,  are  jointly  Gaussian. 
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More  concretely  let  the  covariance  matrix  of  a  finite  collection  of  jointly  Gaussian 
random  variables  XoUXu  be  denoted  by  #),  where  Xo  are  the  observed  variables 
and  Xu  are  the  unobserved,  hidden  variables.  The  marginal  statistics  corresponding 
to  the  observed  variables  Xo  are  given  by  the  marginal  covariance  matrix  So,  which 
is  simply  a  submatrix  of  the  full  covariance  matrix  S(G  u)  ■  However  suppose  that 
we  parameterize  our  model  by  the  concentration  matrix  K(q  h)  =  Hy  which  as 
discussed  above  reveals  the  connection  to  graphical  models.  In  such  a  parametrization, 
the  marginal  concentration  matrix  S q1  corresponding  to  the  observed  variables  Xo  is 
given  by  the  Schur  complement  [82]  with  respect  to  the  block  Kjj- 

k0  =  Ho1  =  K0  -  Ko^K^Kup. 

Thus  if  we  only  observe  the  variables  Xo,  we  only  have  access  to  So  (or  Ko )•  The  two 
terms  that  compose  Ko  above  have  interesting  properties.  The  matrix  Ko  specifies  the 
concentration  matrix  of  the  conditional  statistics  of  the  observed  variables  given  the 
latent  variables.  If  these  conditional  statistics  are  given  by  a  sparse  graphical  model 
then  K0  is  sparse.  On  the  other  hand  the  matrix  Ko.hKJj1  I\u,o  serves  as  a  summary 
of  the  effect  of  marginalization  over  the  hidden  variables  H.  This  matrix  has  small 
rank  if  the  number  of  latent,  unobserved  variables  H  is  small  relative  to  the  number  of 
observed  variables  O  (the  rank  is  equal  to  \H\).  Therefore  the  marginal  concentration 
matrix  Ko  of  the  observed  variables  Xo  is  generally  not  sparse  due  to  the  additional 
low-rank  term  KopKjj1  Ku,o-  Hence  standard  graphical  model  selection  techniques 
applied  directly  to  the  observed  variables  Xo  are  not  useful. 

A  modeling  paradigm  that  infers  the  effect  of  the  latent  variables  Xu  would  be  more 
suitable  in  order  to  provide  a  simple  explanation  of  the  underlying  statistical  structure. 
Hence  we  decompose  Ko  into  the  sparse  and  low-rank  components,  which  reveals  the 
conditional  graphical  model  structure  in  the  observed  variables  as  well  as  the  number 
of  and  effect  due  to  the  unobserved  latent  variables.  Such  a  method  can  be  viewed  as 
a  blend  of  principal  component  analysis  and  graphical  modeling.  In  standard  graphical 
modeling  one  would  directly  approximate  a  concentration  matrix  by  a  sparse  matrix 
in  order  to  learn  a  sparse  graphical  model.  On  the  other  hand  in  principal  component 
analysis  the  goal  is  to  explain  the  statistical  structure  underlying  a  set  of  observations 
using  a  small  number  of  latent  variables  (i.e.,  approximate  a  covariance  matrix  as  a 
low-rank  matrix).  In  our  framework  based  on  decomposing  a  concentration  matrix,  we 
learn  a  graphical  model  among  the  observed  variables  conditioned  on  a  few  (additional) 
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latent  variables.  Notice  that  in  our  setting  these  latent  variables  are  not  principal 
components,  as  the  conditional  statistics  (conditioned  on  these  latent  variables)  are 
given  by  a  graphical  model.  Therefore  we  refer  to  these  latent  variables  informally  as 
hidden  components. 

Our  first  contribution  in  Section  4.3  is  to  address  the  fundamental  question  of  iden- 
tiftability  of  such  latent-variable  graphical  models  given  the  marginal  statistics  of  only 
the  observed  variables.  The  critical  point  is  that  we  need  to  tease  apart  the  correlations 
induced  due  to  marginalization  over  the  latent  variables  from  the  conditional  graphical 
model  structure  among  the  observed  variables.  As  the  identifiability  problem  is  one 
of  uniquely  decomposing  the  sum  of  a  sparse  matrix  and  a  low-rank  matrix  into  the 
individual  components,  we  recall  the  conditions  derived  in  Chapter  3  that  relate  unique 
identifiability  to  properties  of  the  tangent  spaces  to  the  algebraic  varieties  of  sparse  and 
low-rank  matrices.  Specifically  let  Q(Ko)  denote  the  tangent  space  at  I\o  to  the  alge¬ 
braic  variety  of  sparse  matrices,  and  let  T(Ko,hKJj1  Kh,o)  denote  the  tangent  space 
at  Ko.hKJj1  Kh/j  to  the  algebraic  variety  of  low-rank  matrices.  Then  the  statistical 
question  of  identifiability  of  Ko  and  Ko,hKJj1  Kh,o  given  Ko  is  determined  by  the  ge¬ 
ometric  notion  of  transversality  of  the  tangent  spaces  Cl(Ko)  and  T ( Kq , h KJj '  Kuo )  ■ 
The  study  of  the  transversality  of  these  tangent  spaces  leads  us  to  natural  conditions 
for  identifiability.  In  particular  we  show  that  latent-variable  models  in  which  (1)  the 
sparse  matrix  Ko  has  a  small  number  of  nonzeros  per  row/column,  and  (2)  the  low- 
rank  matrix  Ko^hKjj1  Kh,o  has  row/column  spaces  that  are  not  closely  aligned  with 
the  coordinate  axes,  are  identifiable.  These  two  conditions  have  natural  statistical  inter¬ 
pretations.  The  first  condition  ensures  that  there  are  no  densely-connected  subgraphs 
in  the  conditional  graphical  model  structure  among  the  observed  variables  Xo  given 
the  hidden  components,  i.e.,  that  these  conditional  statistics  are  indeed  specified  by  a 
sparse  graphical  model.  Such  statistical  relationships  may  otherwise  be  mistakenly  at¬ 
tributed  to  the  effect  of  marginalization  over  some  latent  variable.  The  second  condition 
ensures  that  the  effect  of  marginalization  over  the  latent  variables  is  “spread  out”  over 
many  observed  variables;  thus,  the  effect  of  marginalization  over  a  latent  variable  is  not 
confused  with  the  conditional  graphical  model  structure  among  the  observed  variables. 
In  fact  the  first  condition  is  often  assumed  in  some  papers  on  standard  graphical  model 
selection  without  latent  variables  (see  for  example  [119]).  We  note  here  that  question 
of  parameter  identifiability  was  recently  studied  for  models  with  discrete- valued  latent 
variables  (i.e.,  mixture  models,  hidden  Markov  models)  [2].  However,  this  work  is  not 
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applicable  to  our  setting  in  which  both  the  latent  and  observed  variables  are  assumed 
to  be  jointly  Gaussian. 

As  our  next  contribution  we  propose  a  regularized  maximum-likelihood  decomposi¬ 
tion  framework  to  approximate  a  given  sample  covariance  matrix  by  a  model  in  which 
the  concentration  matrix  decomposes  into  a  sparse  matrix  and  a  low-rank  matrix.  Mo¬ 
tivated  by  the  combined  l\  norm  and  nuclear  norm  heuristic  proposed  in  Chapter  3  for 
sparse/low-rank  matrix  decomposition,  we  propose  the  following  penalized  likelihood 
method  given  a  sample  covariance  matrix  Y,'q  formed  from  n  samples  of  the  observed 
variables: 


(■ Sn,Ln )  =  argmin 

0,.L/ 

s.t. 


—  £(S  —  L;  T.q)  +  An(7||5||1  +  IV(L)) 
S  —  L  A  0,  LAO. 


(4.1) 


Here  £  represents  the  Gaussian  log-likelihood  function  and  is  given  by  £(if;£)  = 
logdet(A")  —  Tr(A'S)  for  K  A  0,  where  Tr  is  the  trace  of  a  matrix  and  det  is  the  deter¬ 
minant.  The  matrix  Sn  provides  an  estimate  of  Ko ,  which  represents  the  conditional 
concentration  matrix  of  the  observed  variables;  the  matrix  Ln  provides  an  estimate  of 
Kq  h KJj  1  Kfi  o ;  which  represents  the  effect  of  marginalization  over  the  latent  variables. 
Notice  that  the  regularization  function  is  a  combination  of  the  l\  norm  applied  to  S  and 
the  nuclear  norm  applied  to  L  (the  nuclear  norm  reduces  to  the  trace  over  the  cone  of 
symmetric,  positive-semidefinite  matrices),  with  7  providing  a  tradeoff  between  the  two 
terms.  This  variational  formulation  is  a  convex  optimization  problem.  In  particular  it 
is  a  regularized  max-det  problem  and  can  be  solved  in  polynomial  time  using  standard 
off-the-shelf  solvers. 

Our  main  result  in  Section  4.4  is  a  proof  of  the  consistency  of  the  estimator  (4.1)  in 
the  high-dimensional  regime  in  which  both  the  number  of  observed  variables  and  the 
number  of  hidden  components  are  allowed  to  grow  with  the  number  of  samples  (of  the 
observed  variables).  We  show  that  for  a  suitable  choice  of  the  regularization  parameter 
Xn,  there  exists  a  range  of  values  of  7  for  which  the  estimates  ( Sn,Ln )  have  the  same 
sparsity  (and  sign)  pattern  and  rank  as  (Ko,Ko^h{Kh)1Kh,o)  with  high  probability 
(see  Theorem  4.4.1).  The  key  technical  requirement  is  an  identifiability  condition  for 
the  two  components  of  the  marginal  concentration  matrix  Ko  with  respect  to  the 
Fisher  information  (see  Section  4.3.2).  We  make  connections  between  our  condition  and 
the  irrepresentability  conditions  required  for  support/graphical- model  recovery  using 
l\  regularization  [119, 148].  Our  results  provide  numerous  scaling  regimes  under  which 
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consistency  holds  in  latent-variable  graphical  model  selection.  For  example  we  show 
that  under  suitable  identifiability  conditions  consistent  model  selection  is  possible  even 
when  the  number  of  samples  and  the  number  of  latent  variables  are  on  the  same  order 
as  the  number  of  observed  variables  (see  Section  4.4.3). 

Related  previous  work  The  problem  of  decomposing  the  sum  of  a  sparse  matrix  and  a 
low-rank  matrix,  with  no  additional  noise,  into  the  individual  components  was  initially 
studied  in  [37];  the  results  of  that  paper  are  described  in  Chapter  3.  In  subsequent 
work  Candes  et  al.  [31]  also  studied  this  noise-free  sparse-plus-low-rank  decomposition 
problem,  and  provided  guarantees  for  exact  recovery  using  the  convex  program  proposed 
in  [37].  The  problem  setup  considered  in  the  present  chapter  is  quite  different  and  is 
more  challenging  because  we  are  only  given  access  to  an  inexact  sample  covariance 
matrix,  and  we  are  interested  in  recovering  components  that  preserve  both  the  sparsity 
pattern  and  the  rank  of  the  components  in  the  true  underlying  model.  In  addition  to 
proving  such  a  consistency  result  for  the  estimator  (4.1),  we  also  provide  a  statistical 
interpretation  of  our  identifiability  conditions  and  describe  natural  classes  of  latent- 
variable  Gaussian  graphical  models  that  satisfy  these  conditions.  As  such  our  work  is 
closer  in  spirit  to  the  many  recent  papers  on  covariance  selection,  but  with  the  important 
difference  that  some  of  the  variables  are  not  observed. 

Outline  Section  4.2  gives  some  background  on  graphical  models  as  well  as  the  alge¬ 
braic  varieties  of  sparse  and  low-rank  matrices.  It  also  provides  a  formal  statement 
of  the  problem.  Section  4.3  discusses  conditions  under  which  latent-variable  models 
are  identifiable,  and  Section  4.4  states  the  main  results  of  this  chapter.  We  provide 
experimental  demonstration  of  the  consistency  of  our  estimator  on  synthetic  data  in 
Section  4.5.  Section  4.6  concludes  the  chapter  with  a  brief  discussion.  Appendix  B 
include  additional  details  and  proofs  of  all  of  our  technical  results. 

■  4.2  Background  and  Problem  Statement 

We  briefly  discuss  concepts  from  graphical  modeling  and  give  a  formal  statement  of 
the  latent-variable  model  selection  problem.  We  also  describe  various  properties  of  the 
algebraic  varieties  of  sparse  matrices  and  of  low-rank  matrices.  Although  some  of  these 
have  been  introduced  previously,  we  emphasize  again  that  the  following  matrix  norms 
are  employed  throughout  this  chapter: 

•  ||  M  ||  2:  denotes  the  spectral  norm,  which  is  the  largest  singular  value  of  M. 
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•  1 1 M | loo:  denotes  the  largest  entry  in  magnitude  of  M. 

•  ||M||_p:  denotes  the  Frobenius  norm,  which  is  the  square-root  of  the  sums  of  the 
squares  of  the  entries  of  M . 

•  ||M||*:  denotes  the  nuclear  norm,  which  is  the  sum  of  the  singular  values  of  M. 
This  reduces  to  the  trace  for  positive-semidefinite  matrices. 

•  ||M||i:  denotes  the  sum  of  the  absolute  values  of  the  entries  of  M. 


A  number  of  matrix  operator  norms  are  also  used.  For  example,  let  Z  :  MPxp  — >•  M.pxp 
be  a  linear  operator  acting  on  matrices.  Then  the  induced  operator  norm  1 1  ^  1 1 is 
defined  as: 


IZI 


q^q 


max 
NgRp*?,  ||iV 


\\2(N)\\q. 


(4.2) 


Therefore,  denotes  the  spectral  norm  of  the  matrix  operator  Z.  The  only 

vector  norm  used  is  the  Euclidean  norm,  which  is  denoted  by  ||  •  ||. 


■  4.2.1  Gaussian  graphical  models  with  latent  variables 

A  graphical  model  [93]  is  a  statistical  model  defined  with  respect  to  a  graph  (V,£)  in 
which  the  nodes  index  a  collection  of  random  variables  {Xv}v^y,  and  the  edges  rep¬ 
resent  the  conditional  independence  relations  (Markov  structure)  among  the  variables. 
The  absence  of  an  edge  between  nodes  i.j  6  V  implies  that  the  variables  Xi,Xj  are 
independent  conditioned  on  all  the  other  variables.  A  Gaussian  graphical  model  (also 
commonly  referred  to  as  a  Gauss-Markov  random  field)  is  one  in  which  all  the  variables 
are  jointly  Gaussian  [132],  In  such  models  the  sparsity  pattern  of  the  inverse  of  the 
covariance  matrix,  or  the  concentration  matrix,  directly  corresponds  to  the  graphical 
model  structure.  Specifically,  consider  a  Gaussian  graphical  model  in  which  the  covari¬ 
ance  matrix  is  given  by  E  y  0  and  the  concentration  matrix  is  given  by  K  =  E_1.  Then 
an  edge  {i,j}  £  £  is  present  in  the  underlying  graphical  model  if  and  only  if  Kt  J  ^  0. 

Our  focus  in  this  chapter  is  on  Gaussian  models  in  which  some  of  the  variables 
may  not  be  observed.  Suppose  O  represents  the  set  of  nodes  corresponding  to  observed 
variables  Xo,  and  H  the  set  of  nodes  corresponding  to  unobserved,  hidden  variables  Xh 
with  O  U  H  =  V  and  O  O  H  =  0.  The  joint  covariance  is  denoted  by  E(0  and  joint 
concentration  matrix  by  K^q  jj)  =  Hy  The  submatrix  So  represents  the  marginal 
covariance  of  the  observed  variables  Xq,  and  the  corresponding  marginal  concentration 
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matrix  is  given  by  the  Schur  complement  with  respect  to  the  block  Kh- 

K0  =  S’1  =  K0  -  KO'HK-'K^o.  (4.3) 

The  submatrix  Ko  specifies  the  concentration  matrix  of  the  conditional  statistics  of 
the  observed  variables  conditioned  on  the  hidden  components.  If  these  conditional 
statistics  are  given  by  a  sparse  graphical  model  then  Ko  is  sparse.  On  the  other  hand 
the  marginal  concentration  matrix  Ko  of  the  marginal  distribution  of  Xo  is  not  sparse 
in  general  due  to  the  extra  correlations  induced  from  marginalization  over  the  latent 
variables  Xh,  he.,  due  to  the  presence  of  the  additional  term  Ko^Kfi1  Kh,o-  Hence, 
standard  graphical  model  selection  techniques  in  which  the  goal  is  to  approximate  a 
sample  covariance  by  a  sparse  graphical  model  are  not  well-suited  for  problems  in  which 
some  of  the  variables  are  hidden.  However,  the  matrix  Ko^Kf^1  Kh,o  is  a  low-rank 
matrix  if  the  number  of  hidden  variables  is  much  smaller  than  the  number  of  observed 
variables  (i.e. ,  \H\  <C  |0|).  Therefore,  a  more  appropriate  model  selection  method  is 
to  approximate  the  sample  covariance  by  a  model  in  which  the  concentration  matrix 
decomposes  into  the  sum  of  a  sparse  matrix  and  a  low-rank  matrix.  The  objective  here 
is  to  learn  a  sparse  graphical  model  among  the  observed  variables  conditioned  on  some 
latent  variables,  as  such  a  model  explicitly  accounts  for  the  extra  correlations  induced 
due  to  unobserved,  hidden  components. 

■  4.2.2  Problem  statement 

In  order  to  analyze  latent-variable  model  selection  methods,  we  need  to  define  an  appro¬ 
priate  notion  of  model  selection  consistency  for  latent- variable  graphical  models.  Notice 
that  given  the  two  components  Ko  and  Ko^hKJj1  Kh,0  of  the  concentration  matrix  of 
the  marginal  distribution  (4.3),  there  are  infinitely  many  configurations  of  the  latent 
variables  (i.e.,  matrices  Kh  >-  0 ,Ko,h  =  K H,0 )  that  give  rise  to  the  same  low-rank 
matrix  Ko^hKJj1  Kh,o-  Specifically  for  any  non-singular  matrix  B  G  one  can 

apply  the  transformations  Kh  — >  BKhBt  ,  Ko,h  — >  Ko,hBt  and  still  preserve  the 
low-rank  matrix  Ko,h~K~h  Kh,o-  In  aU  °f  these  models  the  marginal  statistics  of  the 
observed  variables  Xo  remain  the  same  upon  marginalization  over  the  latent  variables 
Xh-  The  key  invariant  is  the  low-rank  matrix  Ko^Xfi1  Kh,o,  which  summarizes  the 
effect  of  marginalization  over  the  latent  variables.  These  observations  give  rise  to  the 
following  notion  of  consistency: 

Definition  4.2.1.  A  pair  of  (symmetric)  matrices  ( S,L )  with  S,L  G  Ml°lxl°l  is  an 
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algebraically  consistent  estimate  of  a  latent-variable  Gaussian  graphical  model  given  by 
the  concentration  matrix  K^q  H )  if  the  following  conditions  hold: 

1.  The  sign-pattern  of  S  is  the  same  as  that  of  Ko: 

sign  (Sij)  =  sign  {{K0)i,j),  V(z,  j). 

Here  we  assume  that  sign(O)  =  0. 

2.  The  rank  of  L  is  the  same  as  the  rank  of  Ko,hKJi1Kh,o: 

rank(L)  =  rank  (Ko^Kf]1  Kh,o)- 

3.  The  concentration  matrix  S  —  L  can  be  realized  as  the  marginal  concentration 
matrix  of  an  appropriate  latent-variable  model: 

S  —  L  y  0,  L  y  0. 

The  first  condition  ensures  that  S  provides  the  correct  structural  estimate  of  the 
conditional  graphical  model  (given  by  Ko)  of  the  observed  variables  conditioned  on  the 
hidden  components.  This  property  is  the  same  as  the  “sparsistency”  property  studied 
in  standard  graphical  model  selection  [92,  119].  The  second  condition  ensures  that 
the  number  of  hidden  components  is  correctly  estimated.  Finally,  the  third  condition 
ensures  that  the  pair  of  matrices  ( S ,  L)  leads  to  a  realizable  latent-variable  model.  In 
particular  this  condition  implies  that  there  exists  a  valid  latent-variable  model  on  \OUH\ 
variables  in  which  (a)  the  conditional  graphical  model  structure  among  the  observed 
variables  is  given  by  S,  ( b )  the  number  of  latent  variables  \H\  is  equal  to  the  rank  of  L, 
and  (c)  the  extra  correlations  induced  due  to  marginalization  over  the  latent  variables 
is  equal  to  L.  Any  method  for  matrix  factorization  (see  for  e.g.,  [143])  can  be  used  to 
factorize  the  low-rank  matrix  L,  depending  on  the  properties  that  one  desires  in  the 
factors  (e.g.,  sparsity). 

We  also  study  parametric  consistency  in  the  usual  sense,  i.e.,  we  show  that  one  can 
produce  estimates  (S,  L)  that  converge  in  various  norms  to  the  matrices  ( Kq,  Kq^hK^Kho) 
as  the  number  of  samples  available  goes  to  infinity.  Notice  that  proving  (S,  L )  is  close 
to  (Ko,  Ko,hKJ^ Kh,o)  in  some  norm  does  not  in  general  imply  that  the  support/sign- 
pattern  and  rank  of  ( S,L )  are  the  same  as  those  of  (Ko,Ko,HKfjlKH,o)-  Therefore 
parametric  consistency  is  different  from  algebraic  consistency,  which  requires  that  ( S ,  L) 
have  the  same  support /sign-pattern  and  rank  as  (Ko,Ko,hKJ^Kh,o)- 
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Goal  Let  K^Q  H ^  denote  the  concentration  matrix  of  a  Gaussian  model.  Suppose  that 
we  have  n  samples  {Xq}™=1  of  the  observed  variables  Xo ■  We  would  like  to  produce 
estimates  ( SniLn )  that,  with  high-probability,  are  both  algebraically  consistent  and 
consistent  in  the  parametric  sense  (in  some  norm). 

■  4.2.3  Likelihood  function  and  Fisher  information 

Given  n  samples  {Xl}™=1  of  a  finite  collection  of  jointly  Gaussian  zero-mean  random 
variables  with  concentration  matrix  K* ,  we  define  the  sample  covariance  as  follows: 

n 

En  =  -^XiXf.  (4.4) 

n 

It  is  then  easily  seen  that  the  log-likelihood  function  is  given  by: 

i{K-  £n)  =  log  det(AT)  -  Tr(A^En),  (4.5) 

where  l(K;Yln)  is  a  function  of  K.  Notice  that  this  function  is  strictly  concave  for 
K  >-  0.  Now  consider  the  latent-variable  modeling  problem  in  which  we  wish  to  model 
a  collection  of  random  variables  Xo  (with  sample  covariance  S q)  by  adding  some  extra 
variables  Xu-  With  respect  to  the  parametrization  (S',  A)  (with  S  representing  the 
conditional  statistics  of  Xo  given  Xu ,  and  A  summarizing  the  effect  of  marginalization 
over  the  additional  variables  Xu),  the  likelihood  function  is  given  by: 

e(S,L;l%)=e(S  -  A;  Eg). 

The  function  t  is  jointly  concave  with  respect  to  the  parameters  (S,  A)  whenever  S—Ly 
0,  and  it  is  this  function  that  we  use  in  our  variational  formulation  (4.1)  to  learn  a 
latent-variable  model. 

In  the  analysis  of  a  convex  program  involving  the  likelihood  function,  the  Fisher 
information  plays  an  important  role  as  it  is  the  negative  of  the  Hessian  of  the  likelihood 
function  and  thus  controls  the  curvature.  As  the  first  term  in  the  likelihood  function 
is  linear,  we  need  only  study  higher-order  derivatives  of  the  log-determinant  function 
in  order  to  compute  the  Hessian.  Letting  X  denote  the  Fisher  information  matrix,  we 
have  that  [24] 

I(K*)  ±  log det (AT) | k=k*  =  (A"*)-1  ®  (AT*)-1, 

for  K*  y  0.  If  K*  is  a  p  x  p  concentration  matrix,  then  the  Fisher  information  matrix 
X(AT*)  has  dimensions  p 2  x  p2 .  Next  consider  the  latent-variable  situation  with  the 
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variables  indexed  by  O  being  observed  and  the  the  variables  indexed  by  H  being  hidden. 
The  concentration  matrix  Kq  =  (X^)-1  of  the  marginal  distribution  of  the  observed 
variables  O  is  given  by  the  Schur  complement  (4.3),  and  the  corresponding  Fisher 
information  matrix  is  given  by 

l{k*o)  =  (Kb)'1  ®  (Kbr1  =  So  ®  So- 

Notice  that  this  is  precisely  the  \0\2  x  \0\2  submatrix  of  the  full  Fisher  information 
matrix  X(K*Q  H^)  =  H ^  <S>  T,*Q  H ^  with  respect  to  all  the  parameters  K*q  H ^  = 
(S(0  H))  1  (corresponding  to  the  situation  in  which  all  the  variables  Xouh  are  ob¬ 
served).  The  matrix  I(K*0  has  dimensions  \OUH\2  x  \OUH\2,  while  X(Kq)  is  an 
\0\2  X  | O | 2  matrix.  To  summarize,  we  have  for  all  i,j,k,l  €  O  that: 

Z(Kb)(i,j),(k,l)  =  p(0  H)  ®  S(0  H)](i,j),(k,l)  =Z(K(o  H))(i,j),{k,l)' 

In  Section  4.3.2  we  impose  various  conditions  on  the  Fisher  information  matrix  X(K*0) 
under  which  our  regularized  maximum-likelihood  formulation  provides  consistent  esti¬ 
mates  with  high  probability. 

■  4.2.4  Curvature  of  rank  variety 

Recall  from  Chapter  3  that  S(k )  denotes  the  algebraic  variety  of  matrices  with  at  most 
k  nonzero  entries,  and  that  C(r)  denotes  the  algebraic  variety  of  matrices  with  rank  at 
most  r.  The  sparse  matrix  variety  S(k)  has  the  property  that  it  has  zero  curvature  at 
any  smooth  point.  Consequently  the  tangent  space  at  a  smooth  point  M  is  the  same  as 
the  tangent  space  at  any  point  in  a  neighborhood  of  M.  This  property  is  implicitly  used 
in  the  analysis  of  i\  regularized  methods  for  recovering  sparse  models.  The  situation  is 
more  complicated  for  the  low-rank  matrix  variety,  because  the  curvature  at  any  smooth 
point  is  nonzero.  Therefore  we  need  to  study  how  the  tangent  space  changes  from  one 
point  to  a  neighboring  point  by  analyzing  how  this  variety  curves  locally.  Indeed  the 
amount  of  curvature  at  a  point  is  directly  related  to  the  “angle”  between  the  tangent 
space  at  that  point  and  the  tangent  space  at  a  neighboring  point.  For  any  subspace 
T  of  matrices,  let  Vt  denote  the  projection  onto  T.  Given  two  subspaces  7i,T2  of  the 
same  dimension,  we  measure  the  “twisting”  between  these  subspaces  by  considering  the 
following  quantity. 

p(T1,T2)  =  \\VTl  -  Vt2  II2-+2  =  maxi  II^Ti  -  ^t2](^V)||2- 


(4.6) 
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In  Appendix  B.l  we  briefly  review  relevant  results  from  matrix  perturbation  theory; 
the  key  tool  used  to  derive  these  results  is  the  resolvent  of  a  matrix  [87].  Based  on  these 
tools  we  prove  the  following  two  results  in  Appendix  B.2,  which  bound  the  twisting 
between  the  tangent  spaces  at  nearby  points.  The  first  result  provides  a  bound  on  the 
quantity  p  between  the  tangent  spaces  at  a  point  and  at  its  neighbor. 

Proposition  4.2.1.  Let  M  G  Mpxp  be  a  rank-r  matrix  with  smallest  non-zero  singular 
value  equal  to  a,  and  let  A  be  a  perturbation  to  M  such  that  1 1  1 1 2  <  f  •  Further,  let 

M  +  A  be  a  rank-r  matrix.  Then  we  have  that 

p(T(M  +  A),  T(M))  <  -  II  A||2. 

cr 

The  next  result  bounds  the  error  between  a  point  and  its  neighbor  in  the  normal 
direction. 


Proposition  4.2.2.  Let  M  G  Mpxp  be  a  rank-r  matrix  with  smallest  non-zero  singidar 
value  equal  to  a,  and  let  A  be  a  perturbation  to  M  such  that  ||A||  <  |.  Further,  let 
M  +  A  be  a  rank-r  matrix.  Then  we  have  that 


ll^’r(M)-L(^)ll2  < 


These  results  suggest  that  the  closer  the  smallest  singular  value  is  to  zero,  the  more 
curved  the  variety  is  locally.  Therefore  we  control  the  twisting  between  tangent  spaces 
at  nearby  points  by  bounding  the  smallest  singular  value  away  from  zero. 


■  4.3  Identifiability 

In  the  absence  of  additional  conditions,  the  latent-variable  model  selection  problem  is 
ill-posed.  In  this  section  we  discuss  a  set  of  conditions  on  latent-variable  models  that 
ensure  that  these  models  are  identifiable  given  marginal  statistics  for  a  subset  of  the 
variables.  Recall  that  the  identifiability  conditions  of  Chapter  3  are  directly  applicable 
here,  and  we  rephrase  these  in  the  context  of  latent-variable  graphical  models. 

Structure  between  latent  and  observed  variables 

Suppose  that  the  low-rank  matrix  that  summarizes  the  effect  of  the  hidden  components 
is  itself  sparse.  This  leads  to  identifiability  issues  in  the  sparse-plus-low-rank  decompo¬ 
sition  problem.  Statistically  the  additional  correlations  induced  due  to  marginalization 
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over  the  latent  variables  could  be  mistaken  for  the  conditional  graphical  model  struc¬ 
ture  of  the  observed  variables.  In  order  to  avoid  such  identifiability  problems  the  effect 
of  the  latent  variables  must  be  “diffuse”  across  the  observed  variables.  To  address  this 
point  the  quantity  £(T(M))  was  introduced  in  Chapter  3  (see  also  [37])  to  measure  the 
incoherence  of  the  row/column  spaces  of  M  with  respect  to  the  standard  basis. 

Curvature  and  change  in  £:  As  noted  previously  an  important  technical  point 
is  that  the  algebraic  variety  of  low-rank  matrices  is  locally  curved  at  any  smooth  point. 
Consequently  the  quantity  £  changes  as  we  move  along  the  low-rank  matrix  variety 
smoothly.  The  quantity  p(T\ .  T2)  introduced  in  (4.6)  also  allows  us  to  bound  the  vari¬ 
ation  in  £  as  follows. 

Lemma  4.3.1.  Let  Ti,T2  be  two  matrix  subspaces  of  the  same  dimension  with  the 
property  that  p(T\ ,  T2)  <  1,  where  p  is  defined  in  (4.6).  Then  we  have  that 

C(r2)  <  l_p^T>)  [an)  +  p(t^t2)]. 

This  lemma  is  proved  in  Appendix  B.2. 

Structure  among  observed  variables 

An  identifiability  problem  also  arises  if  the  conditional  graphical  model  among  the  ob¬ 
served  variables  contains  a  densely  connected  subgraph.  These  statistical  relationships 
might  be  mistaken  as  correlations  induced  by  marginalization  over  latent  variables. 
Therefore  we  need  to  ensure  that  the  conditional  graphical  model  among  the  observed 
variables  is  sparse.  We  impose  the  condition  that  this  conditional  graphical  model  must 
have  small  “degree”,  i.e.,  no  observed  variable  is  directly  connected  to  too  many  other 
observed  variables  conditioned  on  the  hidden  components.  Notice  that  bounding  the 
degree  is  a  more  refined  condition  than  simply  bounding  the  total  number  of  non-zeros 
as  the  sparsity  pattern  also  plays  a  role.  As  described  in  Chapter  3  (see  also  [37]),  the 
quantity  p{Q(M))  provides  an  appropriate  measure  of  the  sparsity  pattern  of  a  matrix 
for  the  purposes  of  unique  identifiability. 

■  4.3.1  Transversality  of  tangent  spaces 

From  Chapter  3  we  recall  that  the  transversality  of  the  tangent  spaces  at  the  sparse 
and  low-rank  components  with  respect  to  the  respective  algebraic  varieties  governs  their 
identifiability.  In  order  to  quantify  the  level  of  transversality  between  the  tangent  spaces 
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and  T  we  study  the  minimum  gain  with  respect  to  some  norm  of  the  addition  operator 
restricted  to  the  cartesian  product  y  =  Q  x  T.  More  concretely  let  A  :  Mpxp  x  M.pxp  — > 
Rpxp  represent  the  addition  operator,  i.e.,  the  operator  that  adds  two  matrices.  Then 
given  any  matrix  norm  ||  •  ||  on  Rpxp  X  Rpxp,  the  minimum  gain  of  A  restricted  to  y  is 
defined  as  follows: 


e(fi,T,  ||  •  ||)  ^  min  \\VyA*AVy{S,L) ||, 

(S,L)efixT,  ||(S,L)||=1 

where  Vy  denotes  the  projection  onto  the  space  T,  and  A t  denotes  the  adjoint  of  the 
addition  operator  (with  respect  to  the  standard  Euclidean  inner-product).  The  tangent 
spaces  fl  and  T  have  a  transverse  intersection  if  and  only  if  e(f2,T,  ||  •  ||)  >  0.  The 
“level”  of  transversality  is  measured  by  the  magnitude  of  e(f2,  T,  ||  •  ||).  Note  that  if  the 
norm  ||  •  ||  used  is  the  Frobenius  norm,  then  e(Q,  T,  ||  •  \\p)  is  the  square  of  the  minimum 
singular  value  of  the  addition  operator  A  restricted  to  f IxT. 

A  natural  norm  with  which  to  measure  transversality  is  the  dual  norm  of  the  regular¬ 
ization  function  in  (4.1),  as  the  subdifferential  of  the  regularization  function  is  specified 
in  terms  of  its  dual.  The  reasons  for  this  will  become  clearer  as  we  proceed  through 
this  chapter.  Recall  that  the  regularization  function  used  in  the  variational  formulation 
(4.1)  is  given  by: 

f~t{S,L)  =  7||Sj|1  +  ||T||*, 

where  the  nuclear  norm  ||  •  ||*  reduces  to  the  trace  function  over  the  cone  of  positive- 
semidefinite  matrices.  This  function  is  a  norm  for  all  7  >  0.  The  dual  norm  of  /7  is 
given  by 

(q  n  /  Halloo  ||r||  1 

g-yW,  L)  =  max  l  — — — ,  ||L||2  >  . 

The  following  simple  lemma  records  a  useful  property  of  the  g 7  norm  that  is  used  several 
times  throughout  this  chapter. 

Lemma  4.3.2.  Let  11  and  T  be  tangent  spaces  at  any  points  with  respect  to  the  al¬ 
gebraic  varieties  of  sparse  and  low-rank  matrices.  Then  for  any  matrix  M,  we  have 
that  ||7?n(M)||0O  <  ||M||oo  and  that  ||7?t(1W)||2  <  2 1 1 AT 1 1 2 -  Further  we  also  have  that 
\\Vq±  (Af)||oo  <  \\M\loo  and  that  \\VT±(M)\\2  <  ||M||2.  Thus  for  any  matrices  M,N  and 
fory  =  LlxT,  one  can  check  that  g1(fPy(M,N))  <2g1(M,N)  and  that  g1('Py±(M,  N))  < 
9-r(M,N). 
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Next  we  define  the  quantity  T,  7)  as  follows  in  order  to  study  the  transversality 
of  the  spaces  Ll  and  T  with  respect  to  the  g 7  norm: 

7)  A  max  j^p,  2^)7  j  (4.7) 

Here  g  and  £  are  defined  in  Chapter  3.  We  then  have  the  following  result  (proved  in 
Appendix  B.3): 

Lemma  4.3.3.  Let  S  £  LI,  L  £  T  be  matrices  such  that  ||S'||00  =  7  and  let  \\L\\2  =  1. 
Then  we  have  that  g^(VyAjfAVy(S,  L))  €  [1  —  x(fl,T,7),l  +  x(fi,T,  7)],  where  y  = 
Ll  x  T  and  x(H,  T,  7)  is  defined  in  (4.7).  In  particular  we  have  that  1  —  y(H,T,  7)  < 
e(Ll,T,gfi). 

The  quantity  y(H,T,  7)  being  small  implies  that  the  addition  operator  is  essentially 
isometric  when  restricted  to  y  =  LI  x  T.  Stated  differently  the  magnitude  of  x(H,  T,  7) 
is  a  measure  of  the  level  of  transversality  of  the  spaces  Ll  and  T.  If  p(Li)£(T)  <  ^  then 
7  e  (£(T),  2j^Qj)  ensures  that  x(H,T,  7)  <  1,  which  in  turn  implies  that  the  tangent 
spaces  Ll  and  T  have  a  transverse  intersection. 

Observation:  Thus  we  have  that  the  smaller  the  quantities  fi(Ll)  and  £(T),  the 
more  transverse  the  intersection  of  the  spaces  Ll  and  T. 

■  4.3.2  Conditions  on  Fisher  information 

The  main  focus  of  Section  4.4  is  to  analyze  the  regularized  maximum-likelihood  convex 
program  (4.1)  by  studying  its  optimality  conditions.  The  log-likelihood  function  is  well- 
approximated  in  a  neighborhood  by  a  quadratic  form  given  by  the  Fisher  information 
(which  measures  the  curvature,  as  discussed  in  Section  4.2.3).  Let  X*  =  X(Kq)  denote 
the  Fisher  information  evaluated  at  the  true  marginal  concentration  matrix  Kq  = 
Kb-K(XH(Kwr'KH^  where  K*0  H ^  represents  the  concentration  matrix  of  the  full 
model  (see  equation  (4.3)).  The  appropriate  measure  of  transversality  between  the 
tangent  spaces1  Ll  =  LI(Kq)  and  T  =  T(Kq  h(K*h)  v K*h  0)  is  then  in  a  space  in  which 
the  inner-product  is  given  by  X* .  Specifically,  we  need  to  analyze  the  minimum  gain 
of  the  operator  VyA^X*  AVy  restricted  to  the  space  y  =  Ll  x  T.  Therefore  we  impose 
several  conditions  on  the  Fisher  information  X* .  We  define  quantities  that  control  the 
gains  of  X*  restricted  to  Ll  and  T  separately;  these  ensure  that  elements  of  Ll  and 
1We  implicitly  assume  that  these  tangent  spaces  are  subspaces  of  the  space  of  symmetric  matrices. 
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elements  of  T  are  individually  identifiable  under  the  map  X*.  In  addition  we  define 
quantities  that,  in  conjunction  with  bounds  on  p(Q)  and  £(T),  allow  us  to  control  the 
gain  of  X*  restricted  to  the  direct-sum  0  T. 

X*  restricted  to  Q:  The  minimum  gain  of  the  operator  VqT* Vq  restricted  to  Q  is 
given  by 

«fi=  min  ||PnX*Pn(M)||0O. 

Men,||M||00=i 

The  maximum  effect  of  elements  in  Q  in  the  orthogonal  direction  Q1-  is  given  by 


hn  =  max 

Men,||M||ao=i 


\\V^X*Vn{M)U 


The  operator  X*  is  injective  on  fl  if  ckq  >  0.  The  ratio  ^  <  1  —  v  implies  the 
irrepresentability  condition  imposed  in  [119],  which  gives  a  sufficient  condition  for  con¬ 
sistent  recovery  of  graphical  model  structure  using  Id-regularized  maximum-likelihood. 
Notice  that  this  condition  is  a  generalization  of  the  usual  Lasso  irrepresentability  con¬ 
ditions  [148],  which  are  typically  imposed  on  the  covariance  matrix.  Finally  we  also 
consider  the  following  quantity,  which  controls  the  behavior  of  X*  restricted  to  in  the 
spectral  norm: 


max  ||X*(M)||2. 
Men,||M||2=i 


X*  restricted  to  T :  Analogous  to  the  case  of  tt  one  could  control  the  gains  of  the 
operators  Vt±I*Vt  and  However  as  discussed  previously  one  complication 

is  that  the  tangent  spaces  at  nearby  smooth  points  on  the  rank  variety  are  in  general 
different,  and  the  amount  of  twisting  between  these  spaces  is  governed  by  the  local 
curvature.  Therefore  we  control  the  gains  of  the  operators  7'tv±X*'Pt'  and  XVX*XV  for 
all  tangent  spaces  T'  that  are  “close  to”  the  nominal  T  (at  the  true  underlying  low-rank 
matrix),  measured  by  p(T,Tr )  (4.6)  being  small.  The  minimum  gain  of  the  operator 
restricted  to  T1  (close  to  T )  is  given  by 


ax  —  min  min 

p(T',T)< MeT',||M||2=i 


\\Vt'T*Vt'{M)\\2. 


Similarly  the  maximum  effect  of  elements  in  T'  in  the  orthogonal  direction  T ,_L  (for  T' 
close  to  X)  is  given  by 


5t  —  max  max 

p(T',t)<£H1  MeT’,\\M\\2=l 


\\VT,±1*VT'(M)\\2. 
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Implicit  in  the  definition  of  ax  and  St  is  the  fact  that  the  outer  minimum  and  max¬ 
imum  are  only  taken  over  spaces  T'  that  are  tangent  spaces  to  the  rank-variety.  The 
operator  X*  is  injective  on  all  tangent  spaces  T'  such  that  p(T',T)  <  if  o-r  >  0. 
An  irrepresentability  condition  (analogous  to  those  developed  for  the  sparse  case)  for 
tangent  spaces  near  T  to  the  rank  variety  would  be  that  ^  <  1  —  v.  Finally  we  also 
control  the  behavior  of  X*  restricted  to  T'  close  to  T  in  the  norm: 

Pt  —  max  max  ||X*(M)||0O. 

p(T',T)<Zp-  Mer,||M||0O=i 

The  two  sets  of  quantities  (an,  Sn)  and  (ax,  St)  essentially  control  how  X*  behaves 
when  restricted  to  the  spaces  Q  and  T  separately  (in  the  natural  norms) .  The  quantities 
Pn  and  Pt  are  useful  in  order  to  control  the  gains  of  the  operator  X*  restricted  to 
the  direct  sum  ©  T.  Notice  that  although  the  magnitudes  of  elements  in  Q  are 
measured  most  naturally  in  the  norm,  the  quantity  Pq  is  specified  with  respect  to 
the  spectral  norm.  Similarly  elements  of  the  tangent  spaces  T'  to  the  rank  variety  are 
most  naturally  measured  in  the  spectral  norm,  but  Pt  provides  control  in  the  norm. 
These  quantities,  combined  with  p(£l)  and  £(T),  provide  the  “coupling”  necessary  to 
control  the  behavior  of  X*  restricted  to  elements  in  the  direct  sum  fl  ©  T.  In  order  to 
keep  track  of  fewer  quantities,  we  summarize  the  six  quantities  as  follows: 

a  =  min  (an,  ax) 

5  =  ma ,x(6q,St) 

P  =  ma x(Pq,Pt)- 

Main  assumption  There  exists  a  u  £  (0,  |]  such  that: 

5 

-  <  1  -  2v. 

a 

This  assumption  is  to  be  viewed  as  a  generalization  of  the  irrepresentability  condi¬ 
tions  imposed  on  the  covariance  matrix  [148]  or  the  Fisher  information  matrix  [119]  in 
order  to  provide  consistency  guarantees  for  sparse  model  selection  using  the  t\  norm. 
With  this  assumption  we  have  the  following  proposition,  proved  in  Appendix  B.3,  about 
the  gains  of  the  operator  X*  restricted  to  fl  ©  T .  This  proposition  plays  a  fundamental 
role  in  the  analysis  of  the  performance  of  the  regularized  maximum-likelihood  procedure 
(4.1). 
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Proposition  4.3.1.  Let  12  and  T  be  the  tangent  spaces  defined  in  this  section,  and 
let  X*  be  the  Fisher  information  evaluated  at  the  true  marginal  concentration  matrix. 
Further  let  a,  6,  (3  be  as  defined  above.  Suppose  that 

and  that  7  is  in  the  following  range: 

[3/3(2  -  i/)£(T)  ua 
^  ^  ua  ’  2/3(2  —  u)p(Ll) 

Then  we  have  the  following  two  conclusions  for  y  =  12  x  T'  with  p(T'  ,T )  <  ^p: 

1 .  The  minimum  gain  of  X*  restricted  to  12  ©  T'  is  bounded  below: 

min  g1(VyA^I*A'Py(S,  L))  >  —. 

(S,L)ey,  ||S||oo=7,  ]|L||a=l  2 

Specifically  this  implies  that  for  all  (5,  L)  G  y 

9l(VyA^X*AVy(S,L))  >  ^9l(S,L). 

2.  The  effect  of  elements  iny  =  12  x  T'  on  the  orthogonal  complement  y1-  =  12-1  x  T'1- 
is  bounded  above: 

Vy±A]I*AVy  (VyA]X*AVy^j  1  <1-/7. 

Specifically  this  implies  that  for  all  ( S ,  L)  £  y 

9l(Vy±A^X*AVy(S,L))  <  (1  -  u)9l{VyA^X*AVy{S,L)). 
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■  4.4  Regularized  Maximum-Likelihood  Convex  Program  and  Consistency 

■  4.4.1  Setup 

Let  K*0  H j  denote  the  full  concentration  matrix  of  a  collection  of  zero-mean  jointly- 
Gaussian  observed  and  latent  variables,  let  p  =  \0\  denote  the  number  of  observed 
variables,  and  let  h  =  \H\  denote  the  number  of  latent  variables.  We  are  given  n  sam¬ 
ples  {Xq}^=1  of  the  observed  variables  Xo ■  We  consider  the  high-dimensional  setting 
in  which  (p.  h,  n)  are  all  allowed  to  grow  simultaneously.  The  quantities  a,  5,  /3,  v,  tp 
defined  in  the  previous  section  are  accounted  for  in  our  analysis,  although  we  suppress 
the  dependence  on  these  quantities  in  the  statement  of  our  main  result.  We  explicitly 
keep  track  of  the  quantities  /i(f 1(Kq))  and  £(T(Kq  h(K*h)~1K*h  q))  as  these  control 
the  complexity  of  the  latent- variable  model  given  by  K^0  Hy  In  particular  p  controls 
the  sparsity  of  the  conditional  graphical  model  among  the  observed  variables,  while 
£  controls  the  incoherence  or  “diffusivity”  of  the  extra  correlations  induced  due  to 
marginalization  over  the  hidden  variables.  Based  on  the  tradeoff  between  these  two 
quantities,  we  obtain  a  number  of  classes  of  latent-variable  graphical  models  (and  cor¬ 
responding  scalings  of  (p,  h,  n))  that  can  be  consistently  recovered  using  the  regularized 
maximum-likelihood  convex  program  (4.1)  (see  Section  4.4.3  for  details).  Specifically 
we  show  that  consistent  model  selection  is  possible  even  when  the  number  of  samples 
and  the  number  of  latent  variables  are  on  the  same  order  as  the  number  of  observed 
variables.  We  present  our  main  result  next  demonstrating  the  consistency  of  the  es¬ 
timator  (4.1),  and  then  discuss  classes  of  latent- variable  graphical  models  and  various 
scaling  regimes  in  which  our  estimator  is  consistent. 

■  4.4.2  Main  results 

Given  n  samples  {Xq}2=i  of  the  observed  variables  Xo,  the  sample  covariance  is  defined 
as: 

n 

sS  =  rE4(4f- 

i=  1 

As  discussed  in  Section  4.2.2  the  goal  is  to  produce  an  estimate  given  by  a  pair  of 
matrices  ( S,L )  of  the  latent-variable  model  represented  by  KX  Hy  We  study  the 
consistency  properties  of  the  following  regularized  maximum-likelihood  convex  program: 

(Sn,  Ln)  =  argmin  Tr[(S  -  L)  Eg]  -  logdet(S  -  L)  +  Ari['r||<S'||1  +  Tr(L)] 
s.t.  S  —  L  >-  0,  l  y  0. 


(4.9) 
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Here  Xn  is  a  regularization  parameter,  and  7  is  a  tradeoff  parameter  between  the  rank 
and  sparsity  terms.  Notice  from  Proposition  4.3.1  that  the  choice  of  7  depends  on  the 
values  of  Kq ))  and  £(T(Kq  h(K*h)~1I\*h  0));  essentially  these  quantities  corre¬ 
spond  to  the  degree  of  the  conditional  graphical  model  structure  of  the  observed  vari¬ 
ables  and  the  incoherence  of  the  low-rank  matrix  summarizing  the  effect  of  the  latent 
variables  (see  Section  4.3).  While  these  quantities  may  not  be  known  a  priori,  we  discuss 
a  method  to  choose  7  numerically  in  our  experimental  results  (see  Section  4.5).  The  fol¬ 
lowing  theorem  shows  that  the  estimates  (Sn,  Ln )  provided  by  the  convex  program  (4.9) 
are  consistent  for  a  suitable  choice  of  \n.  In  addition  to  the  appropriate  identifiability 
conditions  (as  specified  by  Proposition  4.3.1),  we  also  impose  lower  bounds  on  the  min¬ 
imum  nonzero  entry  of  the  sparse  conditional  graphical  model  matrix  Kq  and  on  the 
minimum  nonzero  singular  value  of  the  low-rank  matrix  Kq  Kfj  q  summariz¬ 

ing  the  effect  of  the  hidden  variables.  We  suppress  the  dependence  on  a,  /3,  6,  v,  iJj,  and 
emphasize  the  dependence  on  /i(f 1(Kq))  and  £(T(Kq  h(K*h)~1K*h  g))  because  these 
control  the  complexity  of  the  underlying  latent-variable  graphical  model  as  discussed 
above. 

Theorem  4.4.1.  Let  K*Q  H ^  denote  the  concentration  matrix  of  a  Gaussian  model.  We 
have  n  samples  {Xq}2=i  of  the  p  observed  variables  denoted  by  O.  Let  S4  =  Pi(Kq)  and 
T  =  T(Kq  h (K*h)~1IL*h  q)  denote  the  tangent  spaces  at  Kq  and  at  Kq  h{IL*h)^1  K*h  q 
with  respect  to  the  sparse  and  low-rank  matrix  varieties  respectively. 

Assumptions:  Suppose  that  the  following  conditions  hold: 

1.  The  quantities  p(Ll)  and  f(T)  satisfy  the  assumption  of  Proposition  f.3.1  for 
identifiability,  and  7  is  chosen  in  the  range  specified  by  Proposition  4-3.1. 


1  [p_ 

C(T)3Vn' 
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5.  The  minimum  magnitude  nonzero  entry  of  6  of  Kq  is  bounded  as 


Conclusions:  Then  with  probability  greater  than  1  —  2exp{—  p}  we  have: 


1.  Algebraic  consistency:  The  estimate  ( Sn,Ln )  given  by  the  convex  program  (4.9) 
is  algebraically  consistent,  i.e.,  the  support  and  sign  pattern  of  Sn  is  the  same  as 
that  of  I\q,  and  the  rank  of  Ln  is  the  same  as  that  of  Kq  H(K*H)~l Kfj  Q . 

2.  Parametric  consistency:  The  estimate  ( Sn,Ln )  given  by  the  convex  program  (4.9) 
is  parametrically  consistent: 

9l{Sn  -  Kq,  Ln  -  Kq  H(K*H)~lK*H  O)  <  • 

The  proof  of  this  theorem  is  given  in  Appendix  B.4.  The  theorem  essentially  states 
that  if  the  minimum  nonzero  singular  value  of  the  low-rank  piece  I\qH(K*h)^1K^Jq 
and  minimum  nonzero  entry  of  the  sparse  piece  Kq  are  bounded  away  from  zero,  then 
the  convex  program  (4.9)  provides  estimates  that  are  both  algebraically  consistent  and 
parametrically  consistent  (in  the  doc  and  spectral  norms).  In  Section  4.4.4  we  also  show 
that  these  results  easily  lead  to  parametric  consistency  rates  for  the  corresponding 
estimate  (Sn  —  Ln)~l  of  the  marginal  covariance  ££,  of  the  observed  variables. 

Remarks  Notice  that  the  condition  on  the  minimum  singular  value  of  Kq  h(K^)~1  K*h  q 
is  more  stringent  than  on  the  minimum  nonzero  entry  of  Kq.  One  role  played  by 
these  conditions  is  to  ensure  that  the  estimates  ( Sn,Ln )  do  not  have  smaller  sup¬ 
port  size/rank  than  (Kq,  Kq  H(Kfj)~l K^  0).  However  the  minimum  singular  value 
bound  plays  the  additional  role  of  bounding  the  curvature  of  the  low-rank  matrix  va¬ 
riety  around  the  point  Kq  h(K*h)~1  K*h  q,  which  is  the  reason  for  this  condition  being 
more  stringent.  Notice  also  that  the  number  of  hidden  variables  h  does  not  explic¬ 
itly  appear  in  the  sample  complexity  bound  in  Theorem  4.4.1,  which  only  depends  on 
p,  /j,(0.(Kq)),  £(T(Kq  H(K*H)~lK*H  g)).  However  the  dependence  on  h  is  implicit  in  the 
dependence  on  £:(T(KqH(K*h)-1K*Hq)),  and  we  discuss  this  point  in  greater  detail  in 
the  following  section. 

Finally  we  remark  that  algebraic  and  parametric  consistency  hold  under  the  as¬ 
sumptions  of  Theorem  4.4.1  for  a  range  of  values  of  7: 


74  CHAPTER  4.  LATENT  VARIABLE  GRAPHICAL  MODEL  SELECTION  VIA  CONVEX  OPTIMIZATION 

In  particular  the  assumptions  on  the  sample  complexity,  the  minimum  nonzero  singular 
value  of  I\q  h{K*h)  1K*ho,  and  the  minimum  magnitude  nonzero  entry  of  K*Q  are 
governed  by  the  lower  end  of  this  range  for  7.  These  assumptions  can  be  weakened  if 
we  only  require  consistency  for  a  smaller  range  of  values  of  7.  The  following  corollary 
conveys  this  point  with  a  specific  example: 

Corollary  4.4.1.  Consider  the  same  setup  and  notation  as  in  Theorem  4-4- 1-  Sup¬ 
pose  that  the  quantities  and  £(T)  satisfy  the  assumption  of  Proposition  4-3-1  for 

identifiability .  Suppose  that  we  make  the  following  assumptions: 


1.  Let  7  he  chosen  to  be  equal  to  2p{2-v)n{Q)  e  uPPer  end  °f  ^ie  range  specified  in 
Proposition  4-3-1),  i.e.,  7  x  j^- 


2.  n>  p. 


Then  with  probability  greater  than  1  —  2exp{—  p}  we  have  estimates  ( Sn,Ln )  that  are 
algebraically  consistent,  and  parametrically  consistent  with  the  error  bounded  as 

g~f(sn  -  Kb,  Ln  -  Kb)H {k*h ) - 1  k*h,o)  <  • 

The  proof  of  this  corollary  is  analogous  to  that  of  Theorem  4.4.1.  We  emphasize 
that  in  practice  it  is  often  beneficial  to  have  consistent  estimates  for  a  range  of  val¬ 
ues  of  7  (as  in  Theorem  4.4.1).  Specifically  the  stability  of  the  sparsity  pattern  and 
rank  of  the  estimates  ( Sn,Ln )  for  a  range  of  tradeoff  parameters  is  useful  in  order  to 
choose  a  suitable  value  of  7,  as  prior  information  about  the  quantities  /i(f 1(Kq))  and 
£(T(Kq  h(K*h)~1K*h  0))  is  not  typically  available  (see  Section  4.5). 


■  4.4.3  Scaling  regimes 

Next  we  consider  classes  of  latent-variable  models  that  satisfy  the  conditions  of  The¬ 
orem  4.4.1.  Recall  that  n  denotes  the  number  of  samples,  p  denotes  the  number  of 
observed  variables,  and  h  denotes  the  number  of  latent  variables.  We  assume  that  the 
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parameters  a,  /3,S,  is,  if}  defined  in  Section  4.3.2  remain  constant,  and  do  not  scale  with 
the  other  parameters  such  as  ( p,h,n )  or  £(T (Kq  h(K*h)~1  K*h  g))  or  p(Q,(Kq)).  In 
particular  we  focus  on  the  tradeoff  between  £(T (Kq  h(K*h)~1  K*h  Q))  and  /x(f 2(Kq)) 
(the  quantities  that  control  the  complexity  of  a  latent-variable  graphical  model),  and 
the  resulting  scaling  regimes  for  consistent  estimation.  Let  d  =  deg(iL^)  denote 
the  degree  of  the  conditional  graphical  model  among  the  observed  variables,  and  let 
i  =  uic(Kq  h(K*h)~1I\*h  q)  denote  the  incoherence  of  the  correlations  induced  due  to 
marginalization  over  the  latent  variables  (we  suppress  the  dependence  on  n).  These 
quantities  are  defined  in  Chapter  3,  and  we  have  from  the  propositions  therein  that 

rtn(K*o))  <  d,  £ (T(Kq  H {K*h ) _  1  k*h o ) )  <  2 z. 

Since  a,  (5, 5 ,  is,  ^  do  not  scale  with  the  other  parameters,  we  also  have  from  Proposi¬ 
tion  4.3.1  that  the  product  of  //  and  £  must  be  bounded  by  a  constant.  Thus,  we  study 
latent-variable  models  in  which 

d  i  =  0(1). 

As  we  describe  next,  there  are  non-trivial  classes  of  latent-variable  graphical  models  in 
which  this  condition  holds. 

Bounded  degree  and  incoherence:  The  first  class  of  latent-variable  models  that 
we  consider  are  those  in  which  the  conditional  graphical  model  among  the  observed 
variables  (given  by  Kq)  has  constant  degree  d.  Recall  from  Chapter  3  that  the  inco¬ 
herence  i  of  the  effect  of  the  latent  variables  (given  by  I\q  H(K*H)~1  K*H  Q)  can  be  as 
small  as  Consequently  latent-variable  models  in  which 

d  =  0(  1),  h  =  0(p), 

can  be  estimated  consistently  from  n  =  0(p )  samples  as  long  as  the  low-rank  matrix 
Kq  H(Kff)~1K^ Q  is  almost  maximally  incoherent,  i.e.,  i  =  O(sJ^)  so  the  effect  of 
marginalization  over  the  latent  variables  is  diffuse  across  almost  all  the  observed  vari¬ 
ables.  Thus  consistent  model  selection  is  possible  even  when  the  number  of  samples 
and  the  number  of  latent  variables  are  on  the  same  order  as  the  number  of  observed 
variables. 

Polylogarithmic  degree  models  The  next  class  of  models  that  we  study  are 
those  in  which  the  degree  d  of  the  conditional  graphical  model  of  the  observed  vari¬ 
ables  grows  poly-logarithmically  with  p.  Consequently,  the  incoherence  i  of  the  matrix 
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Kq  h{K*h)~1  K*h 0  must  decay  as  the  inverse  of  poly-log(p).  Using  the  fact  that  max¬ 
imally  incoherent  low-rank  matrices  KqH{I\*h)^1K*ho  can  have  incoherence  as  small 

as  \/~,  latent- variable  models  in  which 

V  p’ 

d  =  0(log(p)«),  h  =  o(^K), 
can  be  consistently  estimated  as  long  as  n  =  Op{p  poly-log(p)). 

■  4.4.4  Rates  for  covariance  matrix  estimation 

The  main  result  Theorem  4.4.1  gives  the  number  of  samples  required  for  consistent 
estimation  of  the  sparse  and  low-rank  parts  that  compose  the  marginal  concentration 
matrix  Kq.  Here  we  prove  a  corollary  that  gives  rates  for  covariance  matrix  estima¬ 
tion,  i.e. ,  the  quality  of  the  estimate  (Sn  —  Ln )^1  with  respect  to  the  “true”  marginal 
covariance  matrix 

Corollary  4.4.2.  Under  the  same  conditions  as  in  Theorem  4-4-1,  we  have  with  prob¬ 
ability  greater  than  1  —  2exp{— p}  that 

9,(4f[(S„  -  Ln)-1  -  sy )  <  K 

Specifically  this  implies  that  ||(5n  —  i/ra)_1  —  ^o\h  ~ 

Proof:  The  proof  of  this  lemma  follows  directly  from  duality.  Based  on  the  analysis 
in  Appendix  B.4  (in  particular  using  the  optimality  conditions  of  the  modified  convex 
program  (B.14)),  we  have  that 

g^A^iSn-Ln)-1 -?%])<  \n. 

We  also  have  from  the  bound  on  the  number  of  samples  n  that  (see  Appendix  B.4. 7) 

g,(A][^o  ~  E&])  <  An 

Based  on  the  choice  of  \n  in  Theorem  4.4.1,  we  then  have  the  desired  bound.  □ 

■  4.4.5  Proof  strategy  for  Theorem  4.4.1 

Standard  results  from  convex  analysis  [124]  state  that  ( Sn,Ln )  is  a  minimum  of  the 
convex  program  (4.9)  if  the  zero  matrix  belongs  to  the  subdifferential  of  the  objective 
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function  evaluated  at  ( SniLn )  (in  addition  to  ( Sn,Ln )  satisfying  the  constraints).  The 
subdifferential  of  the  i\  norm  at  a  matrix  M  is  given  by 

N  E  d\\M ||r  44  Vn{M)(N)  =  sign (M),  ||^n(M)±(JV)||oo  <  1. 

For  a  symmetric  positive  semidefinite  matrix  M  with  SVD  M  =  UDUT ,  the  subdiffer¬ 
ential  of  the  trace  function  restricted  to  the  cone  of  positive  semidefinite  matrices  (i.e., 
the  nuclear  norm  over  this  set)  is  given  by: 

N  €  d[Tr(M)  +  IMt0]  44  Vt{M)(N)  =  UUT ,  VT{M)±(N)  ±  I, 

where  Im40  denotes  the  characteristic  function  of  the  set  of  positive  semidefinite  ma¬ 
trices  (i.e.,  the  convex  function  that  evaluates  to  0  over  this  set  and  oo  outside).  The 
key  point  is  that  elements  of  the  subdifferential  decompose  with  respect  to  the  tangent 
spaces  Q(M)  and  T(M).  This  decomposition  property  plays  a  critical  role  in  our  anal¬ 
ysis.  In  particular  it  states  that  the  optimality  conditions  consist  of  two  parts,  one  part 
corresponding  to  the  tangent  spaces  II  and  T  and  another  corresponding  to  the  normal 
spaces  f l1 2-  and  T± . 

Consider  the  optimization  problem  (4.9)  with  the  additional  (non-convex)  con¬ 
straints  that  the  variable  S  belongs  to  the  algebraic  variety  of  sparse  matrices  and 
that  the  variables  L  belongs  to  the  algebraic  variety  of  low-rank  matrices.  While  this 
new  optimization  problem  is  non-convex,  it  has  a  very  interesting  property.  At  a  glob¬ 
ally  optimal  solution  (and  indeed  at  any  locally  optimal  solution)  ( S,L )  such  that  S 
and  L  are  smooth  points  of  the  algebraic  varieties  of  sparse  and  low-rank  matrices,  the 
first-order  optimality  conditions  state  that  the  Lagrange  multipliers  corresponding  to 
the  additional  variety  constraints  must  lie  in  the  normal  spaces  fI(<S,)-L  and  T{L )-*-.  This 
fundamental  observation,  combined  with  the  decomposition  property  of  the  subdiffer¬ 
entials  of  the  l\  and  nuclear  norms,  suggests  the  following  high-level  proof  strategy. 

1.  Let  ( S,L )  be  the  globally  optimal  solution  of  the  optimization  problem  (4.9) 
with  the  additional  constraints  that  (S,  L )  belong  to  the  algebraic  varieties  of 
sparse/low-rank  matrices;  specifically  constrain  S  to  lie  in  5( (support (/Lq)|)  and 
constrain  L  to  lie  in  C(ravik{KQ  H(K*H)~l K*H  0)) .  Show  first  that  ( S,L )  are 
smooth  points  of  these  varieties. 

2.  The  first  part  of  the  subgradient  optimality  conditions  of  the  original  convex 
program  (4.9)  corresponding  to  components  on  the  tangent  spaces  Q(S)  and  T(L) 
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is  satisfied.  This  conclusion  can  be  reached  because  the  additional  Lagrange 
multipliers  due  to  the  variety  constraints  lie  in  the  normal  spaces  n(S')-L  and 

T{L)L. 

3.  Finally  show  that  the  second  part  of  the  subgradient  optimality  conditions  of 
(4.9)  corresponding  to  components  in  the  normal  spaces  Cl(S)~L  and  T(L)*l  is  also 
satisfied. 

Combining  these  steps  together  we  show  that  (S,  L )  satisfy  the  optimality  conditions 
of  the  original  convex  program  (4.9).  Consequently  (5,  L)  is  also  the  optimum  of  the 
convex  program  (4.9).  As  this  estimate  is  also  the  solution  to  the  problem  with  the 
variety  constraints,  the  algebraic  consistency  of  (S',  L)  can  be  directly  concluded.  We 
emphasize  here  that  the  variety-constrained  optimization  problem  is  used  solely  as  an 
analysis  tool  in  order  to  prove  consistency  of  the  estimates  provided  by  the  convex 
program  (4.9).  These  steps  describe  our  broad  strategy,  and  we  refer  the  reader  to 
Appendix  B.4  for  details.  The  key  technical  complication  is  that  the  tangent  spaces 
at  L  and  K*Q  H(K*H)~l K*H  Q  are  in  general  different.  We  bound  the  twisting  between 
these  tangent  spaces  by  using  the  fact  that  the  minimum  non-zero  singular  value  of 
Kq  H(K*HylK*HO  is  bounded  away  from  zero  (as  assumed  in  Theorem  4.4.1  and  using 
Proposition  4.2.1). 

■  4.5  Simulation  Results 

In  this  section  we  give  experimental  demonstration  of  the  consistency  of  our  estimator 
(4.9)  on  synthetic  examples,  and  its  effectiveness  in  modeling  real-world  stock  return 
data.  Our  choices  of  \n  and  7  are  guided  by  Theorem  4.4.1.  Specifically,  we  choose  \n 
to  be  proportional  to  For  7  we  observe  that  the  support  / sign-pattern  and  the  rank 
of  the  solution  (Sn,  Ln )  are  the  same  for  a  range  of  values  of  7.  Therefore  one  could  solve 
the  convex  program  (4.9)  for  several  values  of  7,  and  choose  a  solution  in  a  suitable  range 
in  which  the  sign-pattern  and  rank  of  the  solution  are  stable.  In  practical  problems  with 
real-world  data  these  parameters  may  be  chosen  via  cross-validation.  For  small  problem 
instances  we  solve  the  convex  program  (4.9)  using  a  combination  of  YALMIP  [98]  and 
SDPT3  [136] ,  which  are  standard  off-the-shelf  packages  for  solving  convex  programs.  For 
larger  problem  instances  we  use  the  special  purpose  solver  LogdetPPA  [141]  developed 
for  log-determinant  semidehnite  programs. 
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Figure  4.1.  Synthetic  experiments:  Plot  showing  probability  of  consistent  estimation  of  the  number 
of  latent  variables,  and  the  conditional  graphical  model  structure  of  the  observed  variables,  the  three 
models  studied  are  (a)  36-node  conditional  graphical  model  given  by  a  cycle  with  h  =  2  latent  variables, 
(6)  36-node  conditional  graphical  model  given  by  a  cycle  with  h  =  3  latent  variables,  and  (c)  36-node 
conditional  graphical  model  given  by  a  6  x  6  grid  with  h  =  1  latent  variable.  For  each  plotted  point, 
the  probability  of  consistent  estimation  is  obtained  over  50  random  trials. 


■  4.5.1  Synthetic  data 

In  the  first  set  of  experiments  we  consider  a  setting  in  which  we  have  access  to  samples 
of  the  observed  variables  of  a  latent-variable  graphical  model.  We  consider  several 
latent-variable  Gaussian  graphical  models.  The  first  model  consists  of  p  =  36  observed 
variables  and  h  =  2  hidden  variables.  The  conditional  graphical  model  structure  of  the 
observed  variables  is  a  cycle  with  the  edge  partial  correlation  coefficients  equal  to  0.25; 
thus,  this  conditional  model  is  specified  by  a  sparse  graphical  model  with  degree  2.  The 
second  model  is  the  same  as  the  first  one,  but  with  h  =  3  latent  variables.  The  third 
model  consists  of  h  =  1  latent  variable,  and  the  conditional  graphical  model  structure 
of  the  observed  variables  is  given  by  a  6  x  6  nearest-neighbor  grid  (i.e. ,  p  =  36  and 
degree  4)  with  the  partial  correlation  coefficients  of  the  edges  equal  to  0.15.  In  all  three 
of  these  models  each  latent  variable  is  connected  to  a  random  subset  of  80%  of  the 
observed  variables  (and  the  partial  correlation  coefficients  corresponding  to  these  edges 
are  also  random) .  Therefore  the  effect  of  the  latent  variables  is  “spread  out”  over  most 
of  the  observed  variables,  i.e.,  the  low-rank  matrix  summarizing  the  effect  of  the  latent 
variables  is  incoherent. 
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Figure  4.2.  Stock  returns:  The  figure  on  the  left  shows  the  sparsity  pattern  (black  denotes  an  edge, 
and  white  denotes  no  edge)  of  the  concentration  matrix  of  the  conditional  graphical  model  (135  edges) 
of  the  stock  returns,  conditioned  on  5  latent  variables,  in  a  latent- variable  graphical  model  (number  of 
parameters  equals  639).  This  model  is  learned  using  (4.9),  and  the  KL  divergence  with  respect  to  a 
Gaussian  distribution  specified  by  the  sample  covariance  is  17.7.  The  figure  on  the  left  shows  the  con¬ 
centration  matrix  of  the  graphical  model  (646  edges)  of  the  stock  returns,  learned  using  standard  sparse 
graphical  model  selection  based  on  solving  an  fi-regularized  maximum-likelihood  program  (number  of 
parameters  equals  730).  The  KL  divergence  between  this  distribution  and  a  Gaussian  distribution 
specified  by  the  sample  covariance  is  44.4. 


For  each  model  we  generate  n  samples  of  the  observed  variables,  and  use  the  resulting 
sample  covariance  matrix  Yjq  as  input  to  our  convex  program  (4.9).  Figure  4.1  shows  the 
probability  of  recovery  of  the  support/sign-pattern  of  the  conditional  graphical  model 
structure  in  the  observed  variables  and  the  number  of  latent  variables  (i.e.,  probability 
of  obtaining  algebraically  consistent  estimates)  as  a  function  of  n.  This  probability  is 
evaluated  over  50  experiments  for  each  value  of  n. 

In  all  of  these  cases  standard  graphical  model  selection  applied  directly  to  the  ob¬ 
served  variables  is  not  useful  as  the  marginal  concentration  matrix  of  the  observed 
variables  is  not  well-approximated  by  a  sparse  matrix.  Both  these  sets  of  experiments 
agree  with  our  theoretical  results  that  the  convex  program  (4.9)  is  an  algebraically  con¬ 
sistent  estimator  of  a  latent-variable  model  given  (sufficiently  many)  samples  of  only 
the  observed  variables. 

■  4.5.2  Stock  return  data 

In  the  next  experiment  we  model  the  statistical  structure  of  monthly  stock  returns  of 
84  companies  in  the  S&P  100  index  from  1990  to  2007;  we  disregard  16  companies  that 
were  listed  after  1990.  The  number  of  samples  n  is  equal  to  216.  We  compute  the 
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sample  covariance  based  on  these  returns  and  use  this  as  input  to  (4.9). 

The  model  learned  using  (4.9)  for  suitable  values  of  An,  7  consists  of  h  =  5  latent 
variables,  and  the  conditional  graphical  model  structure  of  the  stock  returns  conditioned 
on  these  hidden  components  consists  of  135  edges.  Therefore  the  number  of  parameters 
in  the  model  is  84  +  135  +  (5  x  84)  =  639.  The  resulting  KL  divergence  between  the 
distribution  specified  by  this  model  and  a  Gaussian  distribution  specified  by  the  sample 
covariance  is  17.7.  Figure  4.2  (left)  shows  the  conditional  graphical  model  structure. 
The  strongest  edges  in  this  conditional  graphical  model,  as  measured  by  partial  corre¬ 
lation,  are  between  Baker  Hughes  -  Schlumberger,  A.T.&T.  -  Verizon,  Merrill  Lynch  - 
Morgan  Stanley,  Halliburton  -  Baker  Hughes,  Intel  -  Texas  Instruments,  Apple  -  Dell, 
and  Microsoft  -  Dell.  It  is  of  interest  to  note  that  in  the  Standard  Industrial  Classi¬ 
fication2  system  for  grouping  these  companies,  several  of  these  pairs  are  in  different 
classes. 

We  compare  these  results  to  those  obtained  using  a  sparse  graphical  model  learned 
using  i\ -regularized  maximum-likelihood  (see  for  example  [119]),  without  introducing 
any  latent  variables.  Figure  4.2  (right)  shows  this  graphical  model  structure.  The 
number  of  edges  in  this  model  is  646  (the  total  number  of  parameters  is  equal  to 
646  +  84  =  730),  and  the  resulting  KL  divergence  between  this  distribution  and  a 
Gaussian  distribution  specified  by  the  sample  covariance  is  44.4.  Indeed  to  obtain  a 
comparable  KL  divergence  to  that  of  the  latent-variable  model  described  above,  one 
would  require  a  graphical  model  with  over  3000  edges. 

These  results  suggest  that  a  latent-variable  graphical  model  is  better  suited  than 
a  standard  sparse  graphical  model  for  modeling  the  statistical  structure  among  stock 
returns.  This  is  likely  due  to  the  presence  of  global,  long-range  correlations  in  stock 
return  data  that  are  better  modeled  via  latent  variables. 

■  4.6  Discussion 

We  have  studied  the  problem  of  modeling  the  statistical  structure  of  a  collection  of 
random  variables  as  a  sparse  graphical  model  conditioned  on  a  few  additional  hidden 
components.  As  a  first  contribution  we  described  conditions  under  which  such  latent- 
variable  graphical  models  are  identifiable  given  samples  of  only  the  observed  variables. 

2See  the  United  States  Securities  and  Exchange  Commission  website  at 
http://www.sec.gov/info/edgar/siccodes.htm 
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We  also  proposed  a  convex  program  based  on  regularized  maximum-likelihood  for  latent- 
variable  graphical  model  selection;  the  regularization  function  is  a  combination  of  the  t\ 
norm  and  the  nuclear  norm.  Given  samples  of  the  observed  variables  of  a  latent- variable 
Gaussian  model  we  proved  that  this  convex  program  provides  consistent  estimates  of 
the  number  of  hidden  components  as  well  as  the  conditional  graphical  model  structure 
among  the  observed  variables  conditioned  on  the  hidden  components.  Our  analysis 
holds  in  the  high-dimensional  regime  in  which  the  number  of  observed/latent  variables 
are  allowed  to  grow  with  the  number  of  samples  of  the  observed  variables.  In  particular 
we  discuss  certain  scaling  regimes  in  which  consistent  model  selection  is  possible  even 
when  the  number  of  samples  and  the  number  of  latent  variables  are  on  the  same  order 
as  the  number  of  observed  variables.  These  theoretical  predictions  are  verified  via  a  set 
of  experiments  on  synthetic  data. 
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■  5.1  Introduction 

Deducing  the  state  or  structure  of  a  system  from  partial,  noisy  measurements  is  a 
fundamental  task  throughout  the  sciences  and  engineering.  A  commonly  encountered 
difficulty  that  arises  in  such  inverse  problems  is  the  very  limited  availability  of  data 
relative  to  the  ambient  dimension  of  the  signal  to  be  estimated.  However  many  in¬ 
teresting  signals  or  models  in  practice  contain  few  degrees  of  freedom  relative  to  their 
ambient  dimension.  For  instance  a  small  number  of  genes  may  constitute  a  signature 
for  disease,  very  few  parameters  may  be  required  to  specify  the  correlation  structure  in 
a  time  series,  or  a  sparse  collection  of  geometric  constraints  might  completely  specify 
a  molecular  configuration.  Such  low-dimensional  structure  plays  an  important  role  in 
making  inverse  problems  well-posed.  In  this  chapter  we  propose  a  unified  approach  to 
transform  notions  of  simplicity  into  convex  penalty  functions,  thus  obtaining  convex 
optimization  formulations  for  inverse  problems. 

We  describe  a  model  as  simple  if  it  can  be  written  as  a  linear  combination  of  a  few 
elements  from  an  atomic  set.  Concretely  let  x  6  be  formed  as  follows: 

k 

x  =  ^2  Ci&i,  a i  <E  A,Ci>  0,  (5.1) 

i=l 

where  A  is  a  set  of  atoms  that  constitute  simple  building  blocks  of  general  signals. 
Here  we  assume  that  x  is  simple  so  that  k  is  relatively  small.  For  example  A  could 
be  the  finite  set  of  unit-norm  one-sparse  vectors  in  which  case  x  is  a  sparse  vector, 
or  A  could  be  the  infinite  set  of  unit-norm  rank-one  matrices  in  which  case  x  is  a 
low-rank  matrix.  These  two  cases  arise  in  many  applications,  and  have  received  a 
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tremendous  amount  of  attention  recently  as  several  authors  have  shown  that  sparse 
vectors  and  low-rank  matrices  can  be  recovered  from  highly  incomplete  information 
[29,30,53,54,121],  However  a  number  of  other  structured  mathematical  objects  also 
fit  the  notion  of  simplicity  described  in  (5.1).  The  set  A  could  be  the  collection  of 
unit-norm  rank-one  tensors,  in  which  case  x  is  a  low-rank  tensor  and  we  are  faced 
with  the  familiar  challenge  of  low-rank  tensor  decomposition.  Such  problems  arise  in 
numerous  applications  in  computer  vision  and  image  processing  [1],  and  in  neuroscience 
[9].  Alternatively  A  could  be  the  set  of  permutation  matrices;  sums  of  a  few  permutation 
matrices  are  objects  of  interest  in  ranking  [84]  and  multi-object  tracking.  As  yet  another 
example,  A  could  consist  of  measures  supported  at  a  single  point  so  that  x  is  an  atomic 
measure  supported  at  just  a  few  points.  This  notion  of  simplicity  arises  in  problems  in 
system  identification  and  statistics. 

In  each  of  these  examples  as  well  as  several  others,  a  fundamental  problem  of  in¬ 
terest  is  to  recover  x  given  limited  linear  measurements.  For  instance  the  question  of 
recovering  a  sparse  function  over  the  group  of  permutations  (i.e.,  the  sum  of  a  few  per¬ 
mutation  matrices)  given  linear  measurements  in  the  form  of  partial  Fourier  information 
was  investigated  in  the  context  of  ranked  election  problems  [84].  Similar  linear  inverse 
problems  arise  with  atomic  measures  in  system  identification,  with  orthogonal  matrices 
in  machine  learning,  and  with  simple  models  formed  from  several  other  atomic  sets  (see 
Section  5.2.2  for  more  examples).  Hence  we  seek  tractable  computational  tools  to  solve 
such  problems.  When  A  is  the  collection  of  one-sparse  vectors,  a  method  of  choice  is  to 
use  the  i\  norm  to  induce  sparse  solutions.  This  method,  as  mentioned  previously,  has 
seen  a  surge  interest  in  the  last  few  years  as  it  provides  a  tractable  convex  optimization 
formulation  to  exactly  recover  sparse  vectors  under  various  conditions  [29,53,54].  Also 
as  discussed  before,  the  nuclear  norm  has  been  proposed  more  recently  as  an  effec¬ 
tive  convex  surrogate  for  solving  rank  minimization  problems  subject  to  various  affine 
constraints  [30,121]. 

Motivated  by  the  success  of  these  methods  we  propose  a  general  convex  optimization 
framework  in  Section  5.2  in  order  to  recover  objects  with  structure  of  the  form  (5.1) 
from  limited  linear  measurements.  The  guiding  question  behind  our  framework  is:  how 
do  we  take  a  concept  of  simplicity  such  as  sparsity  and  derive  the  l\  norm  as  a  convex 
heuristic?  In  other  words  what  is  the  natural  procedure  to  go  from  the  set  of  one-sparse 
vectors  A  to  the  i\  norm?  We  observe  that  the  convex  hull  of  (unit-Euclidean-norm) 
one-sparse  vectors  is  the  unit  ball  of  the  t\  norm,  or  the  cross-polytope.  Similarly  the 
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convex  hull  of  the  (unit-Euclidean-norm)  rank-one  matrices  is  the  nuclear  norm  ball; 
see  Chapter  2  for  illustrations.  These  constructions  suggest  a  natural  generalization 
to  other  settings.  Under  suitable  conditions  the  convex  hull  conv(„4)  defines  the  unit 
ball  of  a  norm,  which  is  called  the  atomic  norm  induced  by  the  atomic  set  A.  We  can 
then  minimize  the  atomic  norm  subject  to  measurement  constraints,  which  results  in  a 
convex  programming  heuristic  for  recovering  simple  models  given  linear  measurements. 
As  an  example  suppose  we  wish  to  recover  the  sum  of  a  few  permutation  matrices  given 
linear  measurements.  The  convex  hull  of  the  set  of  permutation  matrices  is  the  Birkhoff 
polytope  of  doubly  stochastic  matrices  [149],  and  our  proposal  is  to  solve  a  convex 
program  that  minimizes  the  norm  induced  by  this  polytope.  Similarly  if  we  wish  to 
recover  an  orthogonal  matrix  from  linear  measurements  we  would  solve  a  spectral  norm 
minimization  problem,  as  the  spectral  norm  ball  is  the  convex  hull  of  all  orthogonal 
matrices.  As  discussed  in  Section  5.2.5  the  atomic  norm  minimization  problem  is  the 
best  convex  heuristic  for  recovering  simple  models  with  respect  to  a  given  atomic  set. 

We  give  general  conditions  for  exact  and  robust  recovery  using  the  atomic  norm 
heuristic.  In  Section  5.3  we  provide  concrete  bounds  on  the  number  of  generic  linear 
measurements  required  for  the  atomic  norm  heuristic  to  succeed.  This  analysis  is  based 
on  computing  certain  Gaussian  widths  of  tangent  cones  with  respect  to  the  unit  balls  of 
the  atomic  norm  [76].  Arguments  based  on  Gaussian  width  have  been  fruitfully  applied 
to  obtain  bounds  on  the  number  of  Gaussian  measurements  for  the  special  case  of  recov¬ 
ering  sparse  vectors  via  t\  norm  minimization  [127,134],  but  computing  Gaussian  widths 
of  general  cones  is  not  easy.  Therefore  it  is  important  to  exploit  the  special  structure 
in  atomic  norms,  while  still  obtaining  sufficiently  general  results  that  are  broadly  appli¬ 
cable.  An  important  theme  in  this  chapter  is  the  connection  between  Gaussian  widths 
and  various  notions  of  symmetry.  Specifically  by  exploiting  symmetry  structure  in  cer¬ 
tain  atomic  norms  as  well  as  convex  duality  properties,  we  give  bounds  on  the  number 
of  measurements  required  for  recovery  using  very  general  atomic  norm  heuristics.  For 
example  we  provide  precise  estimates  of  the  number  of  generic  measurements  required 
for  exact  recovery  of  an  orthogonal  matrix  via  spectral  norm  minimization,  and  the 
number  of  generic  measurements  required  for  exact  recovery  of  a  permutation  matrix 
by  minimizing  the  norm  induced  by  the  Birkhoff  polytope.  While  these  results  corre¬ 
spond  to  the  recovery  of  individual  atoms  from  random  measurements,  our  techniques 
are  more  generally  applicable  to  the  recovery  of  models  formed  as  sums  of  a  few  atoms 
as  well.  We  also  give  tighter  bounds  than  those  previously  obtained  on  the  number  of 
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Underlying  model 

Convex  heuristic 

#  Gaussian  measurements 

s-sparse  vector  in 

£\  norm 

2s(log(p/ s  -  1)  +  1) 

m  x  m  rank-r  matrix 

nuclear  norm 

3r(2m  —  ?’) 

sign- vector  {  — 1,+1}P 

foo  norm 

p/2 

m  X  m  permutation  matrix 

norm  induced  by  Birkhoff  polytope 

9  m  log (m) 

m  X  m  orthogonal  matrix 

spectral  norm 

cS 

to 

1 

Table  5.1.  A  summary  of  the  recovery  bounds  obtained  using  Gaussian  width  arguments. 


measurements  required  to  robustly  recover  sparse  vectors  and  low-rank  matrices  via  t\ 
norm  and  nuclear  norm  minimization.  In  all  of  the  cases  we  investigate,  we  find  that 
the  number  of  measurements  required  to  reconstruct  an  object  is  proportional  to  its 
intrinsic  dimension  rather  than  the  ambient  dimension,  thus  confirming  prior  folklore. 
See  Table  5.1  for  a  summary  of  these  results. 

Although  our  conditions  for  recovery  and  bounds  on  the  number  of  measurements 
hold  generally,  we  note  that  it  may  not  be  possible  to  obtain  a  computable  representa¬ 
tion  for  the  convex  hull  conv(A)  of  an  arbitrary  set  of  atoms  A.  This  leads  us  to  another 
important  theme  of  this  chapter,  which  we  discuss  in  Section  5.4,  on  the  connection 
between  algebraic  structure  in  A  and  the  semidefinite  represent  ability  of  the  convex 
hull  conv(A).  In  particular  when  A  is  an  algebraic  variety  the  convex  hull  conv(„4)  can 
be  approximated  as  (the  projection  of)  a  set  defined  by  linear  matrix  inequalities.  Thus 
the  resulting  atomic  norm  minimization  heuristic  can  be  solved  via  semidefinite  pro¬ 
gramming.  A  second  issue  that  arises  in  practice  is  that  even  with  algebraic  structure 
in  A  the  semidefinite  representation  of  conv(^I)  may  not  be  computable  in  polyno¬ 
mial  time,  which  makes  the  atomic  norm  minimization  problem  intractable  to  solve.  A 
prominent  example  here  is  the  tensor  nuclear  norm  ball,  which  is  obtained  by  taking 
the  convex  hull  of  the  rank-one  tensors.  In  order  to  address  this  problem  we  study  a 
hierarchy  of  semidefinite  relaxations  using  theta  bodies  [77]  (described  in  Chapter  2), 
which  approximate  the  original  (intractable)  atomic  norm  minimization  problem.  A 
third  point  we  highlight  is  that  while  these  semidefinite  relaxations  are  more  tractable 
to  solve,  we  require  more  measurements  for  exact  recovery  of  the  underlying  model 
than  if  we  solve  the  original  intractable  atomic  norm  minimization  problem.  Hence  we 
have  a  tradeoff  between  the  complexity  of  the  recovery  algorithm  and  the  number  of 
measurements  required  for  recovery.  We  illustrate  this  tradeoff  with  the  cut  polytope, 
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which  is  intractable  to  compute,  and  its  relaxations. 

Outline  Section  5.2  describes  the  construction  of  the  atomic  norm,  gives  several  ex¬ 
amples  of  applications  in  which  these  norms  may  be  useful  to  recover  simple  models,  and 
provides  general  conditions  for  recovery  by  minimizing  the  atomic  norm.  In  Section  5.3 
we  investigate  the  number  of  generic  measurements  for  exact  or  robust  recovery  using 
atomic  norm  minimization,  and  give  estimates  in  a  number  of  settings  by  analyzing 
the  Gaussian  width  of  certain  tangent  cones.  We  address  the  problem  of  semidefinite 
representability  and  tractable  relaxations  of  the  atomic  norm  in  Section  5.4.  Section  5.5 
describes  some  algorithmic  issues  as  well  as  a  few  simulation  results,  and  we  conclude 
with  a  discussion  in  Section  5.6. 

■  5.2  Atomic  Norms  and  Convex  Geometry 

In  this  section  we  describe  the  construction  of  an  atomic  norm  from  a  collection  of 
simple  atoms.  In  addition  we  give  several  examples  of  atomic  norms,  and  discuss  their 
properties  in  the  context  of  solving  ill-posed  linear  inverse  problems.  We  denote  the 
Euclidean  norm  by  ||  •  ||. 

■  5.2.1  Definition 

Let  A  be  a  collection  of  atoms  that  is  a  compact  subset  of  Mp.  We  will  assume  through¬ 
out  this  chapter  that  no  element  a  £  A  lies  in  the  convex  hull  of  the  other  elements 
conv(44\a),  i.e.,  the  elements  of  A  are  the  extreme  points  of  conv(„4).  Let  ||x|U  denote 
the  gauge  of  A  [124]: 

llxIU  =  inf{£  >0  :  conv(^l)}.  (5-2) 

Note  that  the  gauge  is  always  a  convex,  extended-real  valued  function  for  any  set  A.  By 
convention  this  function  evaluates  to  +oo  if  x  does  not  lie  in  the  affine  hull  of  conv(*4). 
We  will  assume  without  loss  of  generality  that  the  centroid  of  conv(*4)  is  at  the  origin, 
as  this  can  be  achieved  by  appropriate  recentering.  With  this  assumption  the  gauge 
function  can  be  rewritten  as: 

||x||.A  =  inf  <  ^2  ca  :  x  =  caa,  ca  >  0  Va  £  A 
la£yt  ae.4 

with  the  sum  being  replaced  by  an  integral  when  A  is  uncountable.  If  A  is  centrally 
symmetric  about  the  origin  (i.e.,  a  £  A  if  and  only  if  —a  £  A)  we  have  that  |j  •  |U  is 
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a  norm,  which  we  call  the  atomic  norm  induced  by  A.  The  support  function  of  A  is 
given  as: 

||x||^4  =  sup  {(x,  a)  :  aed}.  (5.3) 

If  ||  •  ||_4  is  a  norm  the  support  function  |j  ■  ||(^  is  the  dual  norm  of  this  atomic  norm.  From 
this  definition  we  see  that  the  unit  ball  of  ||  •  ||^  is  equal  to  conv(M).  In  many  examples 
of  interest  the  set  A  is  not  centrally  symmetric,  so  that  the  gauge  function  does  not 
define  a  norm.  However  our  analysis  is  based  on  the  underlying  convex  geometry  of 
conv(M),  and  our  results  are  applicable  even  if  ||  •  ||,4  does  not  define  a  norm.  Therefore, 
with  an  abuse  of  terminology  we  generally  refer  to  ||  •  |_4  as  the  atomic  norm  of  the 
set  A  even  if  ||  •  |^  is  not  a  norm.  We  note  that  the  duality  characterization  between 
(5.2)  and  (5.3)  when  ||  •  |j^  is  a  norm  is  in  fact  applicable  even  in  infinite-dimensional 
Banach  spaces  by  Bonsall’s  atomic  decomposition  theorem  [21],  but  our  focus  is  on 
the  finite-dimensional  case  in  this  work.  We  investigate  in  greater  detail  the  issues  of 
represent  ability  and  efficient  approximation  of  these  atomic  norms  in  Section  5.4. 

Equipped  with  a  convex  penalty  function  given  a  set  of  atoms,  we  propose  a  convex 
optimization  method  to  recover  a  “simple”  model  give  limited  linear  measurements. 
Specifically  suppose  that  x*  is  formed  according  to  (5.1)  from  a  set  of  atoms  A.  Further 
suppose  that  we  have  a  known  linear  map  <F  :  Rp  — >  Mn,  and  we  have  linear  information 
about  x*  as  follows: 

y  =  $x*.  (5.4) 


The  goal  is  to  reconstruct  x*  given  y.  We  consider  the  following  convex  formulation  to 
accomplish  this  task: 


x  =  argmin  ||x||.4 

X 

s.t.  y  =  Tx. 


(5.5) 


When  A  is  the  set  of  one-sparse  atoms  this  problem  reduces  to  standard  t\  norm 
minimization.  Similarly  when  A  is  the  set  of  rank-one  matrices  this  problem  reduces 
to  nuclear  norm  minimization.  More  generally  if  the  atomic  norm  ||  •  ||_4  is  tractable  to 
evaluate,  then  (5.5)  potentially  offers  an  efficient  convex  programming  formulation  for 
reconstructing  x*  from  the  limited  information  y.  The  dual  problem  of  (5.5)  is  given 
as  follows: 

T 

max  y  z 

Z 

s.t.  \\&z\\*A  <  1. 

Here  denotes  the  adjoint  (or  transpose)  of  the  linear  measurement  map  <b. 


(5.6) 
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The  convex  formulation  (5.5)  can  be  suitably  modified  in  case  we  only  have  access 
to  inaccurate,  noisy  information.  Specifically  suppose  that  we  have  noisy  measurements 
y  =  3>x*  +  u;  where  t a  represents  the  noise  term.  A  natural  convex  formulation  is  one  in 
which  the  constraint  y  =  $x  of  (5.5)  is  replaced  by  the  relaxed  constraint  ||y  —  <3?x||  <  6, 
where  5  is  an  upper  bound  on  the  size  of  the  noise  uj : 


x  =  argmin  ||x||_4 

X 

s.t.  1 1 y  —  <hx|j  <  6. 


(5.7) 


We  say  that  we  have  exact  recovery  in  the  noise- free  case  if  x  =  x*  in  (5.5),  and  robust 
recovery  in  the  noisy  case  if  the  error  |jx  —  x*|j  is  small  in  (5.7).  In  Section  5.2.4  and 
Section  5.3  we  give  conditions  under  which  the  atomic  norm  heuristics  (5.5)  and  (5.7) 
recover  x*  exactly  or  approximately.  Atomic  norms  have  found  fruitful  applications  in 
problems  in  approximation  theory  of  various  function  classes  [8,46,86,116].  However 
this  prior  body  of  work  was  concerned  with  infinite-dimensional  Banach  spaces,  and 
none  of  these  references  consider  nor  provide  recovery  guarantees  that  are  applicable  in 
our  setting. 


■  5.2.2  Examples 

Next  we  provide  several  examples  of  atomic  norms  that  can  be  viewed  as  special  cases 
of  the  construction  above.  These  norms  are  obtained  by  convexifying  atomic  sets  that 
are  of  interest  in  various  applications. 

Sparse  vectors.  The  problem  of  recovering  sparse  vectors  from  limited  measure¬ 
ments  has  received  a  great  deal  of  attention,  with  applications  in  many  problem  do¬ 
mains.  In  this  case  the  atomic  set  A  C  can  be  viewed  as  the  set  of  unit-norm 
one-sparse  vectors  {±ej}T=1,  and  k- sparse  vectors  in  Mp  can  be  constructed  using  a 
linear  combination  of  k  elements  of  the  atomic  set.  In  this  case  it  is  easily  seen  that  the 
convex  hull  conv(^l)  is  given  by  the  cross-polytope  (i.e. ,  the  unit  ball  of  the  i\  norm; 
see  Chapter  2),  and  the  atomic  norm  ||  •  ||_4  corresponds  to  the  i\  norm  in  Mp. 

Low-rank  matrices.  Recovering  low-rank  matrices  from  limited  information  is  also 
a  problem  that  has  received  considerable  attention  as  it  finds  applications  in  problems 
in  statistics,  control,  and  machine  learning.  The  atomic  set  A  here  can  be  viewed  as 
the  set  of  rank-one  matrices  of  unit-Euclidean-norm.  The  convex  hull  conv(^l)  is  the 
nuclear  norm  ball  of  matrices  in  which  the  sum  of  the  singular  values  is  less  than  or 
equal  to  one  (see  Chapter  2). 


90 


CHAPTER  5.  CONVEX  GEOMETRY  OF  LINEAR  INVERSE  PROBLEMS 


Permutation  matrices.  A  problem  of  interest  in  a  ranking  context  [84]  or  an 
object  tracking  context  is  that  of  recovering  permutation  matrices  from  partial  infor¬ 
mation.  Suppose  that  a  small  number  k  of  rankings  of  m  candidates  is  preferred  by  a 
population.  Such  preferences  can  be  modeled  as  the  sum  of  a  few  m  x  m  permutation 
matrices,  with  each  permutation  corresponding  to  a  particular  ranking.  By  conducting 
surveys  of  the  population  one  can  obtain  partial  linear  information  of  these  preferred 
rankings.  The  set  A  here  is  the  collection  of  permutation  matrices  (consisting  of  ml 
elements),  and  the  convex  hull  conv(A)  is  the  Birkhoff  polytope  or  the  set  of  doubly 
stochastic  matrices  [149].  The  centroid  of  the  Birkhoff  polytope  is  the  matrix  11 T /m, 
so  it  needs  to  be  recentered  appropriately.  We  mention  here  recent  work  by  Jagabathula 
and  Shah  [84]  on  recovering  a  sparse  function  over  the  symmetric  group  (i.e.,  the  sum  of 
a  few  permutation  matrices)  given  partial  Fourier  information;  although  the  algorithm 
proposed  in  [84]  is  tractable  it  is  not  based  on  convex  optimization. 

Binary  vectors.  In  integer  programming  one  is  often  interested  in  recovering 
vectors  in  which  the  entries  take  on  values  of  ±1.  Suppose  that  there  exists  such  a  sign- 
vector,  and  we  wish  to  recover  this  vector  given  linear  measurements.  This  corresponds 
to  a  version  of  the  multi- knapsack  problem  [102].  In  this  case  A  is  the  set  of  all  sign- 
vectors,  and  the  convex  hull  conv(A)  is  the  hypercube  or  the  unit  ball  of  the  £oo  norm. 
The  image  of  this  hypercube  under  a  linear  map  is  also  referred  to  as  a  zonotope  [149]. 

Vectors  from  lists.  Suppose  there  is  an  unknown  vector  x  £  Rp,  and  that  we 
are  given  the  entries  of  this  vector  without  any  information  about  the  locations  of 
these  entries.  For  example  if  x  =  [3  1  2  2  4]',  then  we  are  only  given  the  list  of 
numbers  {1,  2,  2,  3, 4}  without  their  positions  in  x.  Further  suppose  that  we  have  access 
to  a  few  linear  measurements  of  x.  Can  we  recover  x  by  solving  a  convex  program? 
Such  a  problem  is  of  interest  in  recovering  partial  rankings  of  elements  of  a  set.  An 
extreme  case  is  one  in  which  we  only  have  two  preferences  for  rankings,  i.e.,  a  vector 
in  {1,2}P  composed  only  of  one’s  and  two’s,  which  reduces  to  a  special  case  of  the 
problem  above  of  recovering  binary  vectors  (in  which  the  number  of  entries  of  each 
sign  is  fixed).  For  this  problem  the  set  A  is  the  set  of  all  permutations  of  x  (which  we 
know  since  we  have  the  list  of  numbers  that  compose  x),  and  the  convex  hull  conv(A) 
is  the  permutahedron  [129,149]  (see  Chapter  2).  As  with  the  Birkhoff  polytope,  the 
permutahedron  also  needs  to  be  recentered  about  the  point  lrx/p. 

Matrices  constrained  by  eigenvalues.  This  problem  is  in  a  sense  the  non- 
commutative  analog  of  the  one  above.  Suppose  that  we  are  given  the  eigenvalues  A  of 
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a  symmetric  matrix,  but  no  information  about  the  eigenvectors.  Can  we  recover  such 
a  matrix  given  some  additional  linear  measurements?  In  this  case  the  set  A  is  the  set 
of  all  symmetric  matrices  with  eigenvalues  A,  and  the  convex  hull  conv(A)  is  given  by 
the  Schur-Horn  orbitope  [129]  (see  Chapter  2). 

Orthogonal  matrices.  In  many  applications  matrix  variables  are  constrained  to  be 
orthogonal,  which  is  a  non-convex  constraint  and  may  lead  to  computational  difficulties. 
We  consider  one  such  simple  setting  in  which  we  wish  to  recover  an  orthogonal  matrix 
given  limited  information  in  the  form  of  linear  measurements.  In  this  example  the  set 
A  is  the  set  of  m  x  m  orthogonal  matrices,  and  conv(Al)  is  the  spectral  norm  ball. 

Measures.  Recovering  a  measure  given  its  moments  is  another  question  of  interest 
that  arises  in  system  identification  and  statistics.  Suppose  one  is  given  access  to  a  linear 
combination  of  moments  of  an  atomically  supported  measure.  How  can  we  reconstruct 
the  support  of  the  measure?  The  set  A  here  is  the  moment  curve,  and  its  convex  hull 
conv(A)  goes  by  several  names  including  the  Caratheodory  orbitope  [129].  Discretized 
versions  of  this  problem  correspond  to  the  set  A  being  a  finite  number  of  points  on  the 
moment  curve;  the  convex  hull  conv(Al)  is  then  a  cyclic  polytope  [149]. 

Cut  matrices.  In  some  problems  one  may  wish  to  recover  low-rank  matrices  in 
which  the  entries  are  constrained  to  take  on  values  of  ±1.  Such  matrices  can  be  used 
to  model  basic  user  preferences,  and  are  of  interest  in  problems  such  as  collaborative 
filtering  [133].  The  set  of  atoms  A  could  be  the  set  of  rank-one  signed  matrices,  i.e., 
matrices  of  the  form  zzT  with  the  entries  of  z  being  ±1.  The  convex  hull  conv(A)  of 
such  matrices  is  the  cut  polytope  [47] .  An  interesting  issue  that  arises  here  is  that  the 
cut  polytope  is  in  general  intractable  to  characterize.  However  there  exist  several  well- 
known  tractable  semidehnite  relaxations  to  this  polytope  [47,72],  and  one  can  employ 
these  in  constructing  efficient  convex  programs  for  recovering  cut  matrices.  We  discuss 
this  point  in  greater  detail  in  Section  5.4.3. 

Low-rank  tensors.  Low-rank  tensor  decompositions  play  an  important  role  in 
numerous  applications  throughout  signal  processing  and  machine  learning  [91].  De¬ 
veloping  computational  tools  to  recover  low-rank  tensors  is  therefore  of  great  interest. 
In  principle  we  could  solve  a  tensor  nuclear  norm  minimization  problem,  in  which  the 
tensor  nuclear  norm  ball  is  obtained  by  taking  the  convex  hull  of  rank-one  tensors.  A 
computational  challenge  here  is  that  the  tensor  nuclear  norm  is  in  general  intractable 
to  compute;  in  order  to  address  this  problem  we  discuss  further  convex  relaxations  to 
the  tensor  nuclear  norm  using  theta  bodies  in  Section  5.4.  A  number  of  additional 
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technical  issues  also  arise  with  low-rank  tensors  including  the  non-existence  in  general 
of  a  singular  value  decomposition  analogous  to  that  for  matrices  [90] ,  and  the  difference 
between  the  rank  of  a  tensor  and  its  border  rank  [45]. 

Nonorthogonal  factor  analysis.  Suppose  that  a  data  matrix  admits  a  factor¬ 
ization  X  =  AB.  The  matrix  nuclear  norm  heuristic  will  find  a  factorization  into 
orthogonal  factors  in  which  the  columns  of  A  and  rows  of  B  are  mutually  orthogonal. 
However  if  a  priori  information  is  available  about  the  factors,  precision  and  recall  could 
be  improved  by  enforcing  such  priors.  These  priors  may  sacrifice  orthogonality,  but  the 
factors  might  better  conform  with  assumptions  about  how  the  data  are  generated.  For 
instance  in  some  applications  one  might  know  in  advance  that  the  factors  should  only 
take  on  a  discrete  set  of  values  [133].  In  this  case,  we  might  try  to  fit  a  sum  of  rank-one 
matrices  that  are  bounded  in  norm  rather  than  in  i 2  norm.  Another  prior  that 
commonly  arises  in  practice  is  that  the  factors  are  non- negative  (i.e.,  in  non-negative 
matrix  factorization).  These  and  other  priors  on  the  basic  rank-one  summands  induce 
different  norms  on  low-rank  models  than  the  standard  nuclear  norm  [64],  and  may  be 
better  suited  to  specific  applications. 

■  5.2.3  Background  on  tangent  and  normal  cones 

In  order  to  properly  state  our  results,  we  recall  some  basic  concepts  from  convex  anal¬ 
ysis.  A  convex  set  C  is  a  cone  if  it  is  closed  under  positive  linear  combinations.  The 
polar  C*  of  a  cone  C  is  the  cone 

C’  =  {ier  :  (x,z)  <0  \/zeC}. 

Given  some  nonzero  x  6  ip  we  define  the  tangent  cone  at  x  with  respect  to  the  scaled 
unit  ball  ||x||_4Conv(A)  as 

ZU(x)  =  cone{z  -  x  :  f|z||^  <  ||x||^}.  (5.8) 

The  cone  T ^(x)  is  equal  to  the  set  of  descent  directions  of  the  atomic  norm  ||  •  ||_4  at  the 
point  x,  i.e.,  the  set  of  all  directions  d  such  that  the  directional  derivative  is  negative. 
This  notation  is  slightly  overloaded  relative  to  the  notation  in  Chapter  2. 

The  normal  cone  N_ &{x)  at  x  with  respect  to  the  scaled  unit  ball  |jx||_4Conv(A)  is 
defined  to  be  the  set  of  all  directions  s  that  form  obtuse  angles  with  every  descent 
direction  of  the  atomic  norm  ||  •  |_4  at  the  point  x: 

N, a(x)  =  {s  :  (s,z  -  x)  <  0  Vz  s.t.  ||z|U  <  ||x||^}. 


(5.9) 
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The  normal  cone  is  equal  to  the  set  of  all  hyperplanes  given  by  normal  vectors  s  that 
support  the  scaled  unit  ball  ||x||_4Conv(M)  at  x.  Observe  that  the  polar  cone  of  the 
tangent  cone  T ^(x)  is  the  normal  cone  IV a(x)  and  vice-versa.  Moreover  we  have  the 
following  basic  characterization 


AU(x)  =  cone(a||x|U), 

which  states  that  the  normal  cone  IV_4(x)  is  the  conic  hull  of  the  subdifferential  of  the 
atomic  norm  at  x. 

■  5.2.4  Recovery  condition 

The  following  result  gives  a  characterization  of  the  favorable  underlying  geometry  re¬ 
quired  for  exact  recovery.  Let  null(4>)  denote  the  nullspace  of  the  operator  $. 

Proposition  5.2.1.  We  have  that  x  =  x*  is  the  unique  optimal  solution  of  (5.5)  if 
and  only  i/null(<h)  n  T^x*)  =  {0}. 

Proof.  Eliminating  the  equality  constraints  in  (5.5)  we  have  the  equivalent  optimization 
problem 

min  ||x*  +  d||_4  s.t.  d  6  null(<h). 
d 

Suppose  null(4>)  n  T_4(x*)  =  0.  Since  ||x*  +  d||_4  <  ||x*||  implies  d  G  T_4(x*),  we  have 
that  ||x*  +  d||_4  >  ||x*||_4  for  all  d  G  null(<I>)  \  {0}.  Conversely  x*  is  the  unique  optimal 
solution  of  (5.5)  if  |jx*  +  d||_4  >  ||x*||_4  for  all  d  G  null(4>)  \  {0},  which  implies  that 
d  ?TA(x*).  □ 

Proposition  5.2.1  asserts  that  the  atomic  norm  heuristic  succeeds  if  the  nullspace  of 
the  sampling  operator  does  not  intersect  the  tangent  cone  T a(x*)  at  x*.  In  Section  5.3 
we  provide  a  characterization  of  tangent  cones  that  determines  the  number  of  Gaussian 
measurements  required  to  guarantee  such  an  empty  intersection. 

A  tightening  of  this  empty  intersection  condition  can  also  be  used  to  address  the 
noisy  approximation  problem.  The  following  proposition  characterizes  when  x*  can  be 
well-approximated  using  the  convex  program  (5.7). 

Proposition  5.2.2.  Suppose  that  we  are  given  n  noisy  measurements  y  =  4>x*  +  u 
where  ||w||  <  5,  and  :  W  — >  Mn.  Let  x  denote  an  optimal  solution  of  (5.7).  Further 
suppose  for  all  z  G  that  we  have  ||$z||  >  e||z||.  Then  ||x  —  x+||  < 
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Proof.  The  set  of  descent  directions  at  x*  with  respect  to  the  atomic  norm  ball  is  given 
by  the  tangent  cone  T a(x*).  The  error  vector  x  —  x*  lies  in  T a(x*)  because  x  is  a 
minimal  atomic  norm  solution,  and  hence  ||x||_4  <  ||x*m.  It  follows  by  the  triangle 
inequality  that 

||3>(x  -  x*)||  <  ||4>x  -  y ||  +  ||$x*  -  y||  <  25.  (5.10) 

By  assumption  we  have  that 

||$(x-x*)||  >  e||x-x*||,  (5.11) 

which  allows  us  to  conclude  that  ||x  —  x*||  <  f  •  □ 

Therefore,  we  need  only  concern  ourselves  with  estimating  the  minimum  value  of 
for  non-zero  z  £  T/i(x*).  We  denote  this  quantity  as  the  minimum  gain  of  the 
measurement  operator  restricted  to  the  cone  T^(x*).  In  particular  if  this  minimum 
gain  is  bounded  away  from  zero,  then  the  atomic  norm  heuristic  also  provides  robust 
recovery  when  we  have  access  to  noisy  linear  measurements  of  x*. 

■  5.2.5  Why  atomic  norm? 

The  atomic  norm  induced  by  a  set  A  possesses  a  number  of  favorable  properties  that 
are  useful  for  recovering  “simple”  models  from  limited  linear  measurements.  The  key 
point  to  note  from  Section  5.2.4  is  that  the  smaller  the  tangent  cone  at  a  point  x* 
with  respect  to  coiiv(yI),  the  easier  it  is  to  satisfy  the  enrpty-intersection  condition  of 
Proposition  5.2.1. 

Based  on  this  observation  it  is  desirable  that  points  in  conv(„4)  with  smaller  tangent 
cones  correspond  to  simpler  models,  while  points  in  conv(^l)  with  larger  tangent  cones 
generally  correspond  to  more  complicated  models.  The  construction  of  conv(„4)  by 
taking  the  convex  hull  of  A  ensures  that  this  is  the  case.  The  extreme  points  of  conv(„4) 
correspond  to  the  simplest  models,  i.e.,  those  models  formed  from  a  single  element  of  A. 
Further  the  low-dimensional  faces  of  conv(„4)  consist  of  those  elements  that  are  obtained 
by  taking  linear  combinations  of  a  few  basic  atoms  from  A.  These  are  precisely  the 
properties  desired  as  points  lying  in  these  low-dimensional  faces  of  conv(„4)  have  smaller 
tangent  cones  than  those  lying  on  larger  faces. 

We  also  note  that  the  atomic  norm  is  in  some  sense  the  best  possible  convex  heuristic 
for  recovering  simple  models.  Specifically  the  unit  ball  of  any  convex  penalty  heuristic 
must  satisfy  a  key  property:  the  tangent  cone  at  any  a  £  A  with  respect  to  this  unit  ball 
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must  contain  the  vectors  a'  —  a  for  all  a!  £  A.  The  best  convex  penalty  function  is  one 
in  which  the  tangent  cones  at  a  £  A  to  the  unit  ball  are  the  smallest  possible,  while  still 
satisfying  this  requirement.  This  is  because,  as  described  above,  smaller  tangent  cones 
are  more  likely  to  satisfy  the  empty  intersection  condition  required  for  exact  recovery. 
It  is  clear  that  the  smallest  such  convex  set  is  precisely  conv(„4),  hence  implying  that 
the  atomic  norm  is  the  best  convex  heuristic  for  recovering  simple  models. 

Our  reasons  for  proposing  the  atomic  norm  as  a  useful  convex  heuristic  are  quite 
different  from  previous  justifications  of  the  norm  and  the  nuclear  norm.  In  particular 
let  /  :  Mp  — >  M  denote  the  cardinality  function  that  counts  the  number  of  nonzero 
entries  of  a  vector.  Then  the  t\  norm  is  the  convex  envelope  of  /  restricted  to  the 
unit  ball  of  the  norm,  i.e.,  the  best  convex  underestimator  of  /  restricted  to  vectors 
in  the  l^-norm  ball.  This  view  of  the  t\  norm  in  relation  to  the  function  /  is  often 
given  as  a  justification  for  its  effectiveness  in  recovering  sparse  vectors.  However  if 
we  consider  the  convex  envelope  of  /  restricted  to  the  Euclidean  norm  ball,  then  we 
obtain  a  very  different  convex  function  than  the  t\  norm!  With  more  general  atomic 
sets,  it  may  not  be  clear  a  priori  what  the  bounding  set  should  be  in  deriving  the 
convex  envelope.  In  contrast  the  viewpoint  adopted  in  this  chapter  leads  to  a  natural, 
unambiguous  construction  of  the  l\  norm  and  other  general  atomic  norms.  Further 
as  explained  above  it  is  the  favorable  facial  structure  of  the  atomic  norm  ball  that 
makes  the  atomic  norm  a  suitable  convex  heuristic  to  recover  simple  models,  and  this 
connection  is  transparent  in  the  definition  of  the  atomic  norm. 

■  5.3  Recovery  from  Generic  Measurements 

We  consider  the  question  of  using  the  convex  program  (5.5)  to  recover  “simple”  models 
formed  according  to  (5.1)  from  a  generic  measurement  operator  or  map  T  :  MP  — >  Mn. 
Specifically,  we  wish  to  compute  estimates  on  the  number  of  measurements  n  so  that 
we  have  exact  recovery  using  (5.5)  for  most  operators  comprising  of  n  measurements. 
That  is,  the  measure  of  n-measurement  operators  for  which  recovery  fails  using  (5.5) 
must  be  exponentially  small.  In  order  to  conduct  such  an  analysis  we  study  random 
Gaussian  maps  <I>,  in  which  the  entries  are  independent  and  identically  distributed 
Gaussians.  These  measurement  operators  have  the  property  that  the  nullspace  null(<I>) 
is  uniformly  distributed  among  the  set  of  all  ( p  —  n)-dimensional  subspaces  in  MP.  In 
particular  we  analyze  when  such  operators  satisfy  the  conditions  of  Proposition  5.2.1 
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and  Proposition  5.2.2  for  exact  recovery. 

■  5.3.1  Recovery  conditions  based  on  Gaussian  width 

Proposition  5.2.1  requires  that  the  nullspace  of  the  measurement  operator  <J>  must  miss 
the  tangent  cone  T a(x*).  Gordon  [76]  gave  a  solution  to  the  problem  of  characterizing 
the  probability  that  a  random  subspace  (of  some  fixed  dimension)  distributed  uniformly 
misses  a  cone.  We  begin  by  defining  the  Gaussian  width  of  a  set,  which  plays  a  key  role 
in  Gordon’s  analysis. 

Definition  5.3.1.  The  Gaussian  width  of  a  set  S  C  is  defined  as: 

w(S ) :=  Eg  sup  gTz  , 

LzeS 

where  g  ~  A7(0, 1)  is  a  vector  of  independent  zero-mean  unit-variance  Gaussians. 

Gordon  characterized  the  likelihood  that  a  random  subspace  misses  a  cone  C  purely 
in  terms  of  the  dimension  of  the  subspace  aud  the  Gaussian  width  w(C  n  §p_1),  where 
Sp_1  C  is  the  unit  sphere.  Before  describing  Gordon’s  result  formally,  we  introduce 
some  notation.  Let  A k  denote  the  expected  length  of  a  /c-dinrensional  Gaussian  random 
vector.  By  elementary  integration,  we  have  that  A k  =  \/2r(^y^)/r(|).  Further  by 
induction  one  can  show  that  A/c  is  tightly  bounded  as  <  A&  <  \fk. 

The  main  idea  underlying  Gordon’s  theorem  is  a  bound  on  the  minimum  gain  of 
an  operator  restricted  to  a  set.  Specifically,  recall  that  null(<3?)  n  T a(x*)  =  {0}  is  the 
condition  required  for  recovery  by  Proposition  5.2.1.  Thus  if  we  have  that  the  minimum 
gain  of  <h  restricted  to  vectors  in  the  set  T^(x*)  nSp_1  is  bounded  away  from  zero,  then 
it  is  clear  that  null(<h)  fl  T r(x*)  =  0.  We  refer  to  such  minimum  gains  restricted  to  a 
subset  of  the  sphere  as  restricted  minimum  singular  values,  and  the  following  theorem 
of  Gordon  gives  a  bound  these  quantities  [76]: 

Theorem  5.3.1  (Gordon’s  Minimum  Restricted  Singular  Values  Theorem).  Let  fl  be 
a  closed  subset  o/Sp_1.  Let  <b  :  — >  Mn  be  a  random  map  with  i.i.d.  zero-mean 

Gaussian  entries  having  variance  one.  Then  provided  that  A*,  >  w(Q)  +  e,  we  have 

P  min||$z||2>e  >  1  —  -  exp  (  —  -^(Afc  —  w(fl)  —  e)2  )  .  (5-12) 

z  ef!  J  2  y  18  ) 

This  theorem  is  not  explicitly  stated  as  such  in  [76]  but  the  proof  follows  directly  as  a 
result  of  Gordon’s  arguments.  Theorem  5.3.1  allows  us  to  characterize  exact  recovery  in 
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the  noise-free  case  using  the  convex  program  (5.5),  and  robust  recovery  in  the  noisy  case 
using  the  convex  program  (5.7).  Specifically,  we  consider  the  number  of  measurements 
required  for  exact  or  robust  recovery  when  the  measurement  map  $  :  MP  — >  W 1  consists 
of  i.i.d.  zero-mean  Gaussian  entries  having  variance  1/n.  The  normalization  of  the 
variance  ensures  that  the  columns  of  $  are  approximately  unit-norm,  and  is  necessary 
in  order  to  properly  define  a  signal-to-noise  ratio.  The  following  corollary  summarizes 
the  main  results  of  interest  in  our  setting: 

Corollary  5.3.1.  Let  <h  :  Mp  — >  Mn  be  a  random  map  with  i.i.d.  zero-mean  Gaussian 
entries  having  variance  1/n.  Further  let  17  =  Tg(x*)  nSp_1  denote  the  spherical  part 
of  the  tangent  cone  Tg(x*). 

1.  Suppose  that  we  have  measurements  y  =  <f>x*,  and  we  solve  the  convex  program 
(5.5).  Then  x*  is  the  unique  optimum  of  (5.5)  with  high  probability  provided  that 

n>w(n)2  +  0{  1). 


2.  Suppose  that  we  have  noisy  measurements  y  =  3>x*  +  u;,  with  the  noise  co  bounded 
as  |M|  <  5,  and  that  we  solve  the  convex  program  (5.7).  Letting  x  denote  the 
optimal  solution  of  (5.7),  we  have  that  ||x*— x||  <  ^  with  high  probability  provided 


n  > 


w{Sl)2 


+  0(1). 


Proof.  The  two  results  are  simple  consequences  of  Theorem  5.3.1: 


1.  The  first  part  follows  by  setting  e  =  0  in  Theorem  5.3.1. 

2.  For  e£  (0, 1)  we  have  from  Theorem  5.3.1  that 

||$(z)||  =  ||z||  $  ( — 1 1  z  1 1  (5.13) 

\||z||/  y/n 

for  all  z  G  Tg(x*)  with  high  probability.  Therefore  we  can  apply  Proposition  5.2.2 
to  conclude  that  ||x  —  x*||  <  with  high  probability,  provided  that  n  >  + 

0(1). 


□ 
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Gordon’s  theorem  thus  provides  a  simple  characterization  of  the  number  of  measure¬ 
ments  required  for  reconstruction  with  the  atomic  norm.  Indeed  the  Gaussian  width 
of  II  =  Ta(x*)  n  Sp_1  is  the  only  quantity  that  we  need  to  compute  in  order  to  obtain 
bounds  for  both  exact  and  robust  recovery.  Unfortunately  it  is  in  general  not  easy  to 
compute  Gaussian  widths.  Rudelson  and  Vershynin  [127]  have  worked  out  Gaussian 
widths  for  the  special  case  of  tangent  cones  at  sparse  vectors  on  the  boundary  of  the  i\ 
ball,  and  derived  results  for  sparse  vector  recovery  using  t\  minimization  that  improve 
upon  previous  results.  In  the  next  section  we  give  various  well-known  properties  of  the 
Gaussian  width  that  are  useful  in  some  computations.  In  Section  5.3.3  we  discuss  a  new 
approach  to  width  computations  that  gives  near-optimal  recovery  bounds  in  a  variety 
of  settings. 

■  5.3.2  Properties  of  Gaussian  width 

In  this  section  we  record  several  elementary  properties  of  the  Gaussian  width  that  are 
useful  for  computation.  We  begin  by  making  some  basic  observations,  which  are  easily 
derived. 

First  we  note  that  the  width  is  monotonic.  If  S±  C  S2  C  Rp,  then  it  is  clear  from 
the  definition  of  the  Gaussian  width  that 

w(Si)  <  u>(S2). 

Second  we  note  that  if  we  have  a  set  S  CP,  then  the  Gaussian  width  of  S  is  equal  to 
the  Gaussian  width  of  the  convex  hull  of  S: 

w(S)  =  u>(conv(S)). 

This  result  follows  from  the  basic  fact  in  convex  analysis  that  the  maximum  of  a  convex 
function  over  a  convex  set  is  achieved  at  an  extreme  point  of  the  convex  set.  Third  if 
V  C  r  is  a  subspace  in  MP,  then  we  have  that 

wiVn^P"1)  =  \J  dim(U), 

which  follows  from  standard  results  on  random  Gaussians.  This  result  also  agrees  with 
the  intuition  that  a  random  Gaussian  map  <F  misses  a  fc-dimensional  subspace  with 
high  probability  as  long  as  dim  (null  (<!>))  >  k  +  1.  Finally,  if  a  cone  S  C  Rp  is  such  that 
S  =  Si  ©  S2,  where  Si  C  W  is  a  A;- dimensional  cone,  S2  C  is  a  (p  —  fc)-dimensional 
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cone  that  is  orthogonal  to  S\,  and  0  denotes  the  direct  sum  operation,  then  the  width 
can  be  decomposed  as  follows: 

w(S  n  S^-1)2  <  w(Si  n  S^-1)2  +  w(S2  n  s^”1)2. 

These  observations  are  useful  in  a  variety  of  situations.  For  example  a  width  compu¬ 
tation  that  frequently  arises  is  one  in  which  5  =  Si  0  S2  as  described  above,  with  Si 
being  a  k- dimensional  subspace.  It  follows  that  the  width  of  S  n  is  bounded  as 

w(S  n  s?-1)2  <  k  +  w{S2  n  s^1)2. 

These  basic  operations  involving  Gaussian  widths  were  used  by  Rudelson  and  Vershynin 
[127]  to  compute  the  Gaussian  widths  of  tangent  cones  at  sparse  vectors  with  respect 
to  the  l\  norm  ball. 

Another  tool  for  computing  Gaussian  widths  is  based  on  Dudley’s  inequality  [57,96], 
which  bounds  the  width  of  a  set  in  terms  of  the  covering  number  of  the  set  at  all  scales. 

Definition  5.3.2.  Let  S  be  an  arbitrary  compact  subset  of  ML.  The  covering  number 
of  S  in  the  Euclidean  norm  at  resolution  e  is  the  smallest  number,  91(5,  e),  such  that 
91(5,  e)  Euclidean  balls  of  radius  e  cover  S. 

Theorem  5.3.2  (Dudley’s  Inequality).  Let  S  be  an  arbitrary  compact  subset  ofMP,  and 
let  g  be  a  random  vector  with  i.i.d.  zero-mean,  unit-variance  Gaussian  entries.  Then 

poo 

w(S)  <  24  /  0og(9t(5,  e))de.  (5.14) 

Jo 

We  note  here  that  a  weak  converse  to  Dudley’s  inequality  can  be  obtained  via  Su- 
dakov’s  Minoration  [96]  by  using  the  covering  number  for  just  a  single  scale.  Specifically, 
we  have  the  following  lower  bound  on  the  Gaussian  width  of  a  compact  subset  5cKp 
for  any  e  >  0: 

w(S)  >  cev/log(9t(5,e)). 

Here  c  >  0  is  some  universal  constant. 

Although  Dudley’s  inequality  can  be  applied  quite  generally,  estimating  covering 
numbers  is  difficult  in  most  instances.  There  are  a  few  simple  characterizations  avail¬ 
able  for  spheres  and  Sobolev  spaces,  and  some  tractable  arguments  based  on  Maurey’s 
empirical  method  [96].  However  it  is  not  evident  how  to  compute  these  numbers  for 
general  convex  cones.  Also,  in  order  to  apply  Dudley’s  inequality  we  need  to  estimate 
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the  covering  number  at  all  scales.  Further  Dudley’s  inequality  can  be  quite  loose  in 
its  estimates,  and  it  often  introduces  extraneous  polylogarithmic  factors.  In  the  next 
section  we  describe  a  new  mechanism  for  estimating  Gaussian  widths,  which  provides 
near-optimal  guarantees  for  recovery  of  sparse  vectors  and  low-rank  matrices,  as  well 
as  for  several  of  the  recovery  problems  discussed  in  Section  5.3.4. 

■  5.3.3  New  results  on  Gaussian  width 

We  discuss  a  new  dual  framework  for  computing  Gaussian  widths.  In  particular  we 
express  the  Gaussian  width  of  a  cone  in  terms  of  the  dual  of  the  cone.  To  be  fully 
general  let  C  be  a  non-empty  convex  cone  in  Mp,  and  let  C*  denote  the  polar  of  C.  We 
can  then  upper  bound  the  Gaussian  width  of  any  cone  C  in  terms  of  the  polar  cone  C* : 

Proposition  5.3.1.  Let  C  be  any  non-empty  convex  cone  in  Rp,  and  let  g  Af(0 ,/) 
be  a  random  Gaussian  vector.  Then  we  have  the  following  bound: 

w(C  nSp_1)  <  Eg  [dist(g,C*)]  , 

where  dist  here  denotes  the  Euclidean  distance  between  a  point  and  a  set. 

The  proof  is  given  in  Appendix  C.l,  and  it  follows  from  an  appeal  to  convex  duality. 
Proposition  5.3.1  is  more  or  less  a  restatement  of  the  fact  that  the  support  function 
of  a  convex  cone  is  equal  to  the  distance  to  its  polar  cone.  As  it  is  the  square  of  the 
Gaussian  width  that  is  of  interest  to  us  (see  Corollary  5.3.1),  it  is  often  useful  to  apply 
Jensen’s  inequality  to  make  the  following  approximation: 

Eg  [dist  (g,  C*)]2  <Eg[dist(g,C*)2].  (5.15) 

The  inspiration  for  our  characterization  in  Proposition  5.3.1  of  the  width  of  a  cone 
in  terms  of  the  expected  distance  to  its  dual  came  from  the  work  of  Stojnic  [134], 
who  used  linear  programming  duality  to  construct  Gaussian-width-based  estimates  for 
analyzing  recovery  in  sparse  reconstruction  problems.  Specifically,  Stojnic’s  relatively 
simple  approach  recovered  well-known  phase  transitions  in  sparse  signal  recovery  [55], 
and  also  generalized  to  block  sparse  signals  and  other  forms  of  structured  sparsity. 

This  new  dual  characterization  yields  a  number  of  useful  bounds  on  the  Gaussian 
width,  which  we  describe  here.  In  the  following  section  we  use  these  bounds  to  derive 
new  recovery  results.  The  first  result  is  a  bound  on  the  Gaussian  width  of  a  cone  in 
terms  of  the  Gaussian  width  of  its  polar. 
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Lemma  5.3.1.  Let  C  CMP  be  a  non-empty  closed,  convex  cone.  Then  we  have  that 

w(c  n  s^1)2  +  w{C*  n  s^1)2  <  P. 

Proof.  Combining  Proposition  5.3.1  and  (5.15),  we  have  that 

w(C  n  S^"1)2  <  Eg  [dist(g,  C*f]  , 

where  as  before  g  ~  A/”(0, 1).  For  any  z  6  Kp  we  let  IIc(z)  =  arginfuec  ||z  —  u||  denote 
the  projection  of  z  onto  C.  From  standard  results  in  convex  analysis  [124],  we  note  that 
one  can  decompose  any  zgff  into  orthogonal  components  as  follows: 

z  =  nc(z)  +  nc»(z),  (nc(z),nc*(z)}  =  o. 


Therefore  we  have  the  following  sequence  of  bounds: 


^cns*’-1)2  < 


< 


Eg  [dist(g,C*)2] 

Eg  [||nc(g)||2] 

Eg  [||g|[2  -  ||nc*(g)||2] 
P-  Eg  [l|nc*(g)||2] 

P-  Eg  [dist(g,C)2] 
P-w(C*  nsp-1)2. 


□ 

In  many  recovery  problems  one  is  interested  in  computing  the  width  of  a  self-dual 
cone.  For  such  cones  the  following  corollary  to  Lemma  5.3.1  gives  a  simple  solution: 

Corollary  5.3.2.  Let  C  C  ML  be  a  self-dual  cone,  i.e.,  C  =  —  C* .  Then  we  have  that 

miens?-1)2  <  |. 

Proof.  The  proof  follows  directly  from  Lemma  5.3.1  as  w(CnSp_1)2  =  u;(C*nSp_1)2.  □ 

Our  next  bound  for  the  width  of  a  cone  C  is  based  on  the  volume  of  its  polar 
C*  nSp_1.  The  volume  of  a  measurable  subset  of  the  sphere  is  the  fraction  of  the  sphere 
Sp_1  covered  by  the  subset.  Thus  it  is  a  quantity  between  zero  and  one. 
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Theorem  5.3.3  (Gaussian  width  from  volume  of  the  polar).  Let  C  C  MP  be  any  closed, 
convex,  solid  cone,  and  suppose  that  its  polar  C*  is  such  that  C*  nSp_1  has  a  volume  of 
0  €  [0, 1] .  Then  for  p  >  9  we  have  that 

w(C  n  S^1)  <  3 

The  proof  of  this  theorem  is  given  in  Appendix  C.2.  The  main  property  that  we 
appeal  to  in  the  proof  is  Gaussian  isoperimetry.  In  particular  there  is  a  formal  sense 
in  which  a  spherical  cap1  is  the  “extremal  case”  among  all  subsets  of  the  sphere  with 
a  given  volume  0.  Other  than  this  observation  the  proof  mainly  involves  a  sequence  of 
integral  calculations. 

Note  that  if  we  are  given  a  specification  of  a  cone  C  C  W  in  terms  of  a  membership 
oracle,  it  is  possible  to  efficiently  obtain  good  numerical  estimates  of  the  volume  of 
C  n  Sp_1  [58].  Moreover,  simple  symmetry  arguments  often  give  relatively  accurate 
estimates  of  these  volumes.  Such  estimates  can  then  be  plugged  into  Theorem  5.3.3  to 
yield  bounds  on  the  width. 


■  5.3.4  New  recovery  bounds 

We  use  the  bounds  derived  in  the  last  section  to  obtain  new  recovery  results.  First 
using  the  dual  characterization  of  the  Gaussian  width  in  Proposition  5.3.1,  we  are 
able  to  obtain  sharp  bounds  on  the  number  of  measurements  required  for  recovering 
sparse  vectors  and  low-rank  matrices  from  random  Gaussian  measurements  using  convex 
optimization  (i.e.,  £ i -norm  and  nuclear  norm  minimization). 

Proposition  5.3.2.  Let  x*  €  W  be  an  s-sparse  vector.  Letting  A  denote  the  set  of 
unit- Euclidean-norm  one-sparse  vectors,  we  have  that 


w(TA(x*))2  < 


2s  (log  (V)  +  1))  s<TTeP 

2.s(log(p  —  s)  +  1)  otherwise. 


Thus,  when  s  <  0.26 p,  2s(log(p/s  —  1)  +  1)  random  Gaussian  measurements  suffice  to 
recover  x*  via  £\  norm  minimization  with  high  probability.  Moreover,  2s(log(p  —  s)  +  1) 
measurements  suffice  for  any  value  of  s. 

XA  spherical  cap  is  a  subset  of  the  sphere  obtained  by  intersecting  the  sphere  Sp_1  with  a  halfspace. 
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Proposition  5.3.3.  Let  x*  be  an  m\  x  m2  rank-r  matrix  with  m\  <  m 2-  Letting  A 
denote  the  set  of  unit- Euclidean-norm  rank-one  matrices,  we  have  that 

w(T^(x*))2  <  3r(mi  +  m2  —  r). 

Thus  3r(mi  +  m2  —  r)  random  Gaussian  measurements  suffice  to  recover  x*  via  nuclear 
norm  minimization  with  high  probability. 

The  proofs  of  these  propositions  are  given  in  Appendix  C.3.  The  number  of  mea¬ 
surements  required  by  these  bounds  is  on  the  same  order  as  previously  known  re¬ 
sults  [28,53],  but  with  improved  constants.  We  also  note  that  we  have  robust  recovery 
at  these  thresholds.  Further  these  results  do  not  require  explicit  recourse  to  any  type 
of  restricted  isometry  property  [28],  and  the  proofs  are  simple  and  based  on  elementary 
integrals. 

Next  we  obtain  a  set  of  recovery  results  by  appealing  to  Corollary  5.3.2  on  the  width 
of  a  self-dual  cone.  These  examples  correspond  to  the  recovery  of  individual  atoms  (i.e., 
the  extreme  points  of  the  set  conv(A)),  although  the  same  machinery  is  applicable  in 
principle  to  estimate  the  number  of  measurements  required  to  recover  models  formed 
as  sums  of  a  few  atoms  (i.e.,  points  lying  on  low-dimensional  faces  of  conv(A)).  We 
first  obtain  a  well-known  result  on  the  number  of  measurements  required  for  recovering 
sign-vectors  via  t ^  norm  minimization. 

Proposition  5.3.4.  Let  x*  e  {  —  1,  +1}P  be  a  sign-vector  in  Mp,  and  let  A  be  the  set  of 
all  such  sign-vectors.  Then  we  have  that 

w(Ta(a*))2  <  f  • 

Thus  I  random  Gaussian  measurements  suffice  to  recover  x*  via  -norm  minimization 
with  high  probability. 

Proof.  The  tangent  cone  at  any  signed  vector  x*  with  respect  to  the  ball  is  a 
rotation  of  the  nonnegative  orthant.  Thus  we  only  need  to  compute  the  Gaussian 
width  of  an  orthant  in  Mp.  As  the  orthant  is  self-dual,  we  have  the  required  bound  from 
Corollary  5.3.2.  □ 

This  result  agrees  with  previously  computed  bounds  in  [56, 102],  which  relied  on 
a  more  complicated  combinatorial  argument.  Next  we  compute  the  number  of  mea¬ 
surements  required  to  recover  orthogonal  matrices  via  spectral-norm  minimization  (see 
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Section  5.2.2).  Let  O(m)  denote  the  group  of  m  x  m  orthogonal  matrices,  viewed  as  a 
subgroup  of  the  set  of  nonsingular  matrices  in  Mmxm. 


Proposition  5.3.5.  Let  x*  G  Kmxro  be  an  orthogonal  matrix,  and  let  A  be  the  set  of 
all  orthogonal  matrices.  Then  we  have  that 


w(TAfx*))2  < 


3  m2  —  m 


4 


Thus  3m4  m  random  Gaussian  measurements  suffice  to  recover  x*  via  spectral-norm 
minimization  with  high  probability. 


Proof.  Due  to  the  symmetry  of  the  orthogonal  group,  it  suffices  to  consider  the  tangent 
cone  at  the  identity  matrix  I  with  respect  to  the  spectral  norm  ball.  Recall  that  the 
spectral  norm  ball  is  the  convex  hull  of  the  orthogonal  matrices.  Therefore  the  tangent 
space  at  the  identity  matrix  with  respect  to  the  orthogonal  group  O(m)  is  a  subset  of 
the  tangent  cone  TA(I).  It  is  well-known  that  this  tangent  space  is  the  set  of  all  m  x  m 
skew-symmetric  matrices.  Thus  we  only  need  to  compute  the  component  S  of  TA(I) 
that  lies  in  the  subspace  of  symmetric  matrices: 


S  =  cone{M  —  I  :  \\AL\\a  <  1,  M  symmetric} 

=  con e{UDUT  —  UUT  :  ||D||_4  <  1,  D  diagonal,  U  G  Q(m)} 
=  con e{U(D  —  I)UT  :  ||Z)||_4  <  1 ,  D  diagonal,  U  G  O(m)} 

=  — PSDm. 


Here  PSD.m  denotes  the  set  of  m  x  m  symmetric  positive-semidefinite  matrices.  As  this 
cone  is  self-dual,  we  can  apply  Corollary  5.3.2  in  conjunction  with  the  observations  in 
Section  5.3.2  to  conclude  that 


w(TA(I))2  < 


fm+  1\ 

n  2  ) 


3  m2  —  m 


4 


□ 


We  note  that  the  number  of  degrees  of  freedom  in  an  m  x  m  orthogonal  matrix  (i.e., 
the  dimension  of  the  manifold  of  orthogonal  matrices)  is  khkLJd  Proposition  5.3.4 
and  Proposition  5.3.5  point  to  the  importance  of  obtaining  recovery  bounds  with  sharp 
constants.  Larger  constants  in  either  result  would  imply  that  the  number  of  measure¬ 
ments  required  exceeds  the  ambient  dimension  of  the  underlying  x*.  In  these  and  many 
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other  cases  of  interest  Gaussian  width  arguments  not  only  give  order-optimal  recovery 
results,  but  also  provide  precise  constants  that  result  in  sharp  recovery  thresholds. 

Finally  we  give  a  third  set  of  recovery  results  that  appeal  to  the  Gaussian  width 
bound  of  Theorem  5.3.3.  The  following  measurement  bound  applies  to  cases  when 
conv(„4)  is  a  symmetric  polytope  (roughly  speaking,  all  the  vertices  are  “equivalent”), 
and  is  a  simple  corollary  of  Theorem  5.3.3. 

Corollary  5.3.3.  Suppose  that  the  set  A  is  a  finite  collection  of  m  points,  with  the  con¬ 
vex  hull  conv(„4)  being  a  vertex-transitive  polytope  [149]  whose  vertices  are  the  points  in 
A.  Using  the  convex  program  (5.5)  we  have  that  9  log(m)  random  Gaussian  measure¬ 
ments  suffice,  with  high  probability,  for  exact  recovery  of  a  point  in  A,  i.e.,  a  vertex  of 
conv(A). 

Proof  We  recall  the  basic  fact  from  convex  analysis  that  the  normal  cones  at  the  vertices 
of  a  convex  polytope  in  provide  a  partitioning  of  ML.  As  conv(A)  is  a  vertex-transitive 
polytope,  the  normal  cone  at  a  vertex  covers  —  fraction  of  W.  Applying  Theorem  5.3.3, 
we  have  the  desired  result.  □ 

Clearly  we  require  the  number  of  vertices  to  be  bounded  as  m  <  exp{|},  so  that 
the  estimate  of  the  number  of  measurements  is  not  vacuously  true.  This  result  has 
useful  consequences  in  settings  in  which  conv(„4)  is  a  combinatorial  polytope ,  as  such 
polytopes  are  often  vertex-transitive.  We  have  the  following  example  on  the  number  of 
measurements  required  to  recover  permutation  matrices: 

Proposition  5.3.6.  Let  x*  e  Mmxm  be  a  permutation  matrix,  and  let  A  be  the  set 
of  all  m  x  m  permutation  matrices.  Then  9  m  log(m)  random  Gaussian  measurements 
suffice,  with  high  probability,  to  recover  x*  by  solving  the  optimization  problem  (5.5), 
which  minimizes  the  norm  induced  by  the  Birkhoff  polytope  of  doubly  stochastic  matrices. 

Proof.  This  result  follows  from  Corollary  5.3.3  by  noting  that  there  are  ml  permutation 
matrices  of  size  m  x  m.  □ 

■  5.4  Representability  and  Algebraic  Geometry  of  Atomic  Norms 

■  5.4.1  Role  of  algebraic  structure 

All  of  our  discussion  thus  far  has  focussed  on  arbitrary  atomic  sets  A.  As  seen  in 
Section  5.2  the  geometry  of  the  convex  hull  conv(„4)  completely  determines  conditions 
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under  which  exact  recovery  is  possible  using  the  convex  program  (5.5).  In  this  section 
we  address  the  question  of  computationally  representing  the  convex  hull  conv(„4)  (or 
equivalently  of  computing  the  atomic  norm  ||  •  ||_4).  These  issues  are  critical  in  order 
to  be  able  to  solve  the  convex  optimization  problem  (5.5).  Although  the  convex  hull 
conv(„4)  is  a  well-defined  object,  in  general  we  may  not  even  be  able  to  computationally 
represent  it  (for  example,  if  A  is  a  fractal).  In  order  to  obtain  exact  or  approximate 
representations  (analogous  to  the  cases  of  the  t\  norm  and  the  nuclear  norm)  it  is 
important  to  impose  some  structure  on  the  atomic  set  A.  We  focus  on  cases  in  which 
the  set  A  has  algebraic  structure.  Specifically  let  the  ring  of  multivariate  polynomials 
in  p  variables  be  denoted  by  M[x]  =  M[xi, . . . ,  xp].  We  then  consider  real  algebraic 
varieties  [18]: 

Definition  5.4.1.  A  real  algebraic  variety  S  C  MP  is  the  set  of  real  solutions  of  a 
system  of  polynomial  equations: 

S  =  {x  :  &,(x)  =  0,  Vj}, 
where  {gj}  is  a  finite  collection  of  polynomials  in  M[x] . 

Indeed  all  of  the  atomic  sets  A  considered  in  this  chapter  are  examples  of  alge¬ 
braic  varieties.  Algebraic  varieties  have  the  remarkable  property  that  (the  closure  of) 
their  convex  hull  can  be  arbitrarily  well-approximated  in  a  constructive  manner  as  (the 
projection  of)  a  set  defined  by  linear  matrix  inequality  constraints.  A  potential  com¬ 
plication  may  arise,  however,  if  these  semidefinite  representations  are  intractable  to 
compute  in  polynomial  time.  In  such  cases  it  is  possible  to  approximate  the  convex 
hulls  via  a  hierarchy  of  tractable  semidefinite  relaxations.  We  describe  these  results  in 
more  detail  in  Section  5.4.2.  Therefore  the  atomic  norm  minimization  problems  such  as 
(5.7)  arising  in  such  situations  can  be  solved  exactly  or  approximately  via  semidefinite 
programming. 

Algebraic  structure  also  plays  a  second  important  role  in  atomic  norm  minimization 
problems.  If  an  atomic  norm  ||  •  ||_4  is  intractable  to  compute,  we  may  approximate  it  via 
a  more  tractable  norm  ||  •  || app.  However  not  every  approximation  of  the  atomic  norm  is 
equally  good  for  solving  inverse  problems.  As  illustrated  in  Figure  5.1  we  can  construct 
approximations  of  the  t\  ball  that  are  tight  in  a  metric  sense,  with  (1  —  e) 1 1  •  || app  < 
II  •  || £i  <  (1  +  e)||  •  1 1 app ,  but  where  the  tangent  cones  at  sparse  vectors  in  the  new 
norm  are  halfspaces.  In  such  a  case,  the  number  of  measurements  required  to  recover 
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Figure  5.1.  The  convex  body  given  by  the  dotted  line  is  a  good  metric  approximation  to  the  t\  ball. 
However  as  its  “corners”  are  “smoothed  out”,  the  tangent  cone  at  x*  goes  from  being  a  proper  cone 
(with  respect  to  the  t\  ball)  to  a  halfspace  (with  respect  to  the  approximation). 


the  sparse  vector  ends  up  being  on  the  same  order  as  the  ambient  dimension.  (Note 
that  the  £i-norm  is  in  fact  tractable  to  compute;  we  simply  use  it  here  for  illustrative 
purposes.)  The  key  property  that  we  seek  in  approximations  to  an  atomic  norm  ||  ■  ||_4 
is  that  they  preserve  algebraic  structure  such  as  the  vertices/extreme  points  and  more 
generally  the  low- dimensional  faces  of  the  conv(^4).  As  discussed  in  Section  5.2.5  points 
on  such  low-dimensional  faces  correspond  to  simple  models,  and  algebraic-structure 
preserving  approximations  ensure  that  the  tangent  cones  at  simple  models  with  respect 
to  the  approximations  are  not  too  much  larger  than  the  corresponding  tangent  cones 
with  respect  to  the  original  atomic  norms. 

■  5.4.2  Semidefinite  relaxations  using  Theta  bodies  -  an  example 

In  this  section  we  give  an  example  of  a  family  of  semidefinite  relaxations  to  the  atomic 
norm  minimization  problem;  the  hierarchy  of  relaxations  is  obtained  using  the  Theta- 
bodies  construction  of  [77]  (see  Chapter  2  for  a  brief  summary),  and  is  applicable 
whenever  the  atomic  set  has  algebraic  structure.  To  begin  with  if  we  approximate  the 
atomic  norm  ||  ■  |_4  by  another  atomic  norm  ||  •  ||^  defined  using  a  larger  collection  of 
atoms  A  C  A,  it  is  clear  that 


Consequently  outer  approximations  of  the  atomic  set  give  rise  to  approximate  norms 
that  provide  lower  bounds  on  the  optimal  value  of  the  problem  (5.5). 

In  order  to  provide  such  lower  bounds  on  the  optimal  value  of  (5.5),  we  discuss 
semidefinite  relaxations  of  the  convex  hull  conv(A)  based  on  Theta  bodies.  Specifically 
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we  discuss  an  example  application  of  these  relaxations  to  the  problem  of  approximating 
the  tensor  nuclear  norm.  We  focus  on  the  case  of  tensors  of  order  three  that  lie  in 
i.e.,  tensors  indexed  by  three  numbers,  for  notational  simplicity,  although 
our  discussion  is  applicable  more  generally.  In  particular  the  atomic  set  A  is  the  set  of 
unit-Euclidean-norm  rank-one  tensors: 

A  =  {u  <g)  v  <g)  w  :  u,  v,  w  E  Mm,  ||u||  =  ||v||  =  ||w||  =  1} 

=  {N  €  Mm  :  N  =  u  <g>  v  ®  w,  u,  v,  w  £  Mm,  ||u||  =  ||v||  =  ||w||  =  1}, 

where  u<8>v(8)w  is  the  tensor  product  of  three  vectors.  Note  that  the  second  description 
is  written  as  the  projection  onto  Mm'3  of  a  variety  defined  in  Mm3+3m.  The  nuclear  norm 
is  then  given  by  (5.2),  and  is  intractable  to  compute  in  general.  Now  let  /g  denote  a 
polynomial  ideal  of  polynomial  maps  from  Km'3+m  to  M: 

m 

Ia  =  {9  '■  9  =  ^  9ijk(Nijk-vnvj'Wk)+gu(uTu-l)+gv(vTv-l)+gw(wTw-l),\/gijk,  gu,  gv,  gw}- 

i,j,k=  1 

Here  gu,  gv,  gw,  {gijk}i,j,k  are  polynomials  in  the  variables  N,  u,  v,  w.  Following  the 
program  described  above  for  constructing  approximations,  a  family  of  semidefinite  re¬ 
laxations  to  the  tensor  nuclear  norm  ball  can  be  prescribed  in  this  manner  via  the  theta 
bodies 

■  5.4.3  Tradeoff  between  relaxation  and  number  of  measurements 

As  discussed  in  Section  5.2.5  the  atomic  norm  is  the  best  convex  heuristic  for  solving 
ill-posed  linear  inverse  problems  of  the  type  considered  in  this  chapter.  However  we  may 
wish  to  approximate  the  atomic  norm  in  cases  when  it  is  intractable  to  compute  exactly, 
and  the  discussion  in  the  preceding  section  provides  one  approach  to  constructing  a 
family  of  relaxations.  As  one  might  expect  the  tradeoff  for  using  such  approximations, 
i.e.,  a  weaker  convex  heuristic  than  the  atomic  norm,  is  an  increase  in  the  number  of 
measurements  required  for  exact  or  robust  recovery.  The  reason  for  this  is  that  the 
approximate  norms  have  larger  tangent  cones  at  their  extreme  points,  which  makes  it 
harder  to  satisfy  the  empty  intersection  condition  of  Proposition  5.2.1.  We  highlight 
this  tradeoff  here  with  an  illustrative  example  involving  the  cut  polytope. 

The  cut  polytope  is  defined  as  the  convex  hull  of  all  cut  matrices: 


V  =  convjzz7  :  z  G  {— 1,  +l}m}. 
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As  described  in  Section  5.2.2  low-rank  matrices  that  are  composed  of  ±l’s  as  entries 
are  of  interest  in  collaborative  filtering  [133],  and  the  norm  induced  by  the  cut  polytope 
is  a  potential  convex  heuristic  for  recovering  such  matrices  from  limited  measurements. 
However  it  is  well-known  that  the  cut  polytope  is  intractable  to  characterize  [47],  and 
therefore  we  need  to  use  tractable  relaxations  instead.  We  consider  the  following  two 
relaxations  of  the  cut  polytope.  The  first  is  the  popular  relaxation  that  is  used  in 
semidefinite  approximations  of  the  MAXCUT  problem: 

Vi  =  {M  :  M  symmetric,  M  y  0,  Mu  =  1,  Vi  =  1,  •  •  •  ,p}. 

This  is  the  well-studied  elliptope  [47],  and  can  also  be  interpreted  as  the  second  theta 
body  relaxation  (see  Chapter  2)  of  the  cut  polytope  V  [77].  We  also  investigate  the 
performance  of  a  second,  weaker  relaxation: 

V2  =  {M  :  M  symmetric,  Mu  =  1,  Vi,  |A%|  <  ±1,  Vi  /  j}. 

This  polytope  is  simply  the  convex  hull  of  symmetric  matrices  with  ±l’s  in  the  off- 
diagonal  entries,  and  l’s  on  the  diagonal.  We  note  that  V2  is  an  extremely  weak 
relaxation  of  V .  but  we  use  it  here  only  for  illustrative  purposes.  It  is  easily  seen  that 

V  c  Vi  c  V2 , 

with  all  the  inclusions  being  strict.  Figure  5.2  gives  a  toy  sketch  that  highlights  all  the 
main  geometric  aspects  of  these  relaxations.  In  particular  V\  has  many  more  extreme 
points  that  V,  although  the  set  of  vertices  of  V\,  i.e.,  points  that  have  full-dimensional 
normal  cones,  are  precisely  the  cut  matrices  (which  are  the  vertices  of  V)  [47].  The 
convex  polytope  V2  contains  many  more  vertices  compared  to  V  as  shown  in  Figure  5.2. 
As  expected  the  tangent  cones  at  vertices  of  V  become  increasingly  larger  as  we  use 
successively  weaker  relaxations.  The  following  result  summarizes  the  number  of  random 
measurements  required  for  recovering  a  cut  matrix,  i.e.,  a  rank-one  sign  matrix,  using 
the  norms  induced  by  each  of  these  convex  bodies. 

Proposition  5.4.1.  Suppose  x*  £  Mmxm  is  a  rank-one  sign  matrix,  i.e.,  a  cut  matrix, 
and  we  are  given  n  random  Gaussian  measurements  of  x* .  We  wish  to  recover  x*  by 
solving  a  convex  program  based  on  the  norms  induced  by  each  of  V  ,V  i,V2.  We  have 
exact  recovery  of  x*  in  each  of  these  cases  with  high  probability  under  the  following 
conditions  on  the  number  of  measurements: 
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Figure  5.2.  A  toy  sketch  illustrating  the  cut  polytope  V ,  and  the  two  approximations  "Pi  and  Vi- 
Note  that  V\  is  a  sketch  of  the  standard  semidefinite  relaxation  that  has  the  same  vertices  as  V  ■  On 
the  other  hand  V2  is  a  polyhedral  approximation  to  V  that  has  many  more  vertices  as  shown  in  this 
sketch. 


1.  Using  V:  n  =  0(m). 

2.  Using  V\:  n  =  0(m). 

3.  Using  V2:  n  = 

Proof.  For  the  first  part,  we  note  that  V  is  a  symmetric  polytope  with  2m^1  vertices. 
Therefore  we  can  apply  Corollary  5.3.3  to  conclude  that  n  =  0(m)  measurements 
suffices  for  exact  recovery. 

For  the  second  part  we  note  that  the  tangent  cone  at  x*  with  respect  to  the  nuclear 
norm  ball  of  m  X  m  matrices  contains  within  it  the  tangent  cone  at  x*  with  respect 
to  the  polytope  V\.  Hence  we  appeal  to  Proposition  5.3.3  to  conclude  that  n  =  0(m) 
measurements  suffices  for  exact  recovery. 

Finally,  we  note  that  V2  is  essentially  the  hypercube  in  ( dimensions.  Appealing 

2 

to  Proposition  5.3.4,  we  conclude  that  n  =  m  ^ m  measurements  suffices  for  exact 
recovery.  O 

It  is  not  too  hard  to  show  that  these  bounds  are  order-optimal,  and  that  they 
cannot  be  improved.  Thus  we  have  a  rigorous  demonstration  in  this  particular  instance 
of  the  fact  that  the  number  of  measurements  required  for  exact  recovery  increases  as  the 
relaxations  get  weaker  (and  as  the  tangent  cones  get  larger).  The  principle  underlying 
this  illustration  holds  more  generally,  namely  that  there  exists  a  tradeoff  between  the 
complexity  of  the  convex  heuristic  and  the  number  of  measurements  required  for  exact 
or  robust  recovery.  It  would  be  of  interest  to  quantify  this  tradeoff  in  other  settings, 
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for  example,  in  problems  in  which  we  use  increasingly  tighter  relaxations  of  the  atomic 
norm  via  theta  bodies. 

We  also  note  that  the  tractable  relaxation  based  on  V\  is  only  off  by  a  constant 
factor  with  respect  to  the  optimal  heuristic  based  on  the  cut  polytope  V.  This  suggests 
the  potential  for  tractable  heuristics  to  approximate  hard  atomic  norms  with  provable 
approximation  ratios,  akin  to  methods  developed  in  the  literature  on  approximation 
algorithms  for  hard  combinatorial  optimization  problems. 

■  5.4.4  Terracini’s  lemma  and  lower  bounds  on  recovery 

Algebraic  structure  in  the  atomic  set  A  provides  yet  another  interesting  insight,  namely 
for  giving  lower  bounds  on  the  number  of  measurements  required  for  exact  recovery. 
The  recovery  condition  of  Proposition  5.2.1  states  that  the  nullspace  null(4>)  of  the 
measurement  operator  <3>  :  MP  — >  Rn  must  miss  the  tangent  cone  T a(x*)  at  the  point  of 
interest  x*.  Suppose  that  this  tangent  cone  contains  a  (/-dimensional  subspace.  It  is  then 
clear  from  straightforward  linear  algebra  arguments  that  the  number  of  measurements  n 
must  exceed  q.  Indeed  this  bound  must  hold  for  any  linear  measurement  scheme.  Thus 
the  dimension  of  the  subspace  contained  inside  the  tangent  cone  provides  a  simple  lower 
bound  on  the  number  of  linear  measurements. 

In  this  section  we  discuss  a  method  to  obtain  estimates  of  the  dimension  of  a  sub¬ 
space  component  of  the  tangent  cone.  We  focus  again  on  the  setting  in  which  A  is  an 
algebraic  variety.  Indeed  in  all  of  the  examples  of  Section  5.2.2,  the  atomic  set  A  is 
an  algebraic  variety.  In  such  cases  simple  models  x*  formed  according  to  (5.1)  can  be 
viewed  as  elements  of  secant  varieties. 

Definition  5.4.2.  Let  A  £  W  be  an  algebraic  variety.  Then  the  k  ’th  secant  variety 
Ak  is  defined  as  the  union  of  all  affine  spaces  passing  through  any  k  +  1  points  of  A. 

Algebraic  geometry  has  a  long  history  of  investigations  of  secant  varieties,  as  well 
as  tangent  spaces  to  these  secant  varieties  [79].  In  particular  a  question  of  interest  is 
to  characterize  the  dimensions  of  secant  varieties  and  tangent  spaces.  In  our  context, 
estimates  of  these  dimensions  are  useful  in  giving  lower  bounds  on  the  number  of  mea¬ 
surements  required  for  recovery.  Specifically  we  have  the  following  result,  which  states 
that  certain  linear  spaces  must  lie  in  the  tangent  cone  at  x*  with  respect  to  conv(„4): 

Proposition  5.4.2.  Let  A  C  Rp  be  a  smooth  variety,  and  let  T(u,  A)  denote  the 
tangent  space  at  any  u  G  A  with  respect  to  A.  Suppose  x  =  Yli= i  c*ao  Va*  £  A,  Ci  >  0, 
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such  that 

k 

IMU  =  ^2ci- 

i=  1 

Then  the  tangent  cone  T a(x*)  contains  the  following  linear  space: 

T(&i,A)  ©  •••  ©T(afe,.A)  c  Ta(x*), 
where  ©  denotes  the  direct  sum  of  subspaces. 

Proof.  We  note  that  if  we  perturb  ai  slightly  to  any  neighboring  a)  so  that  a)  £  A, 
then  the  resulting  x'  =  cia^  +  Yli=2  c2a*  is  such  that  [jx'm  <  ||x||^.  The  proposition 
follows  directly  from  this  observation.  □ 

By  Terracini’s  lemma  [79]  from  algebraic  geometry  the  subspace  7~ (ai .  A)  ©  •  •  •  © 
T(a.k,A)  is  in  fact  the  estimate  for  the  tangent  space  T(x,  Ak~l)  at  x  with  respect  to 
the  ( k  —  l)’th  secant  variety  Ak~l : 

Proposition  5.4.3  (Terracini’s  Lemma).  Let  A  C  MP  be  a  smooth  affine  variety,  and 
let  T(u,  A)  denote  the  tangent  space  at  any  u  £  A  with  respect  to  A.  Suppose  x  £  Ak~l 
is  a  generic  point  such  that  x  =  Yli=i  ciai,  Va*  £  A,ci  >  0.  Then  the  tangent  space 
T(x,*4fc_1)  at  x  with  respect  to  the  secant  variety  Ak~k  is  given  by  T(sli,A)  ©  •  •  •  © 
T(a.k,A).  Moreover  the  dimension  of  T (x,  Ak~1)  is  at  most  (and  is  expected  to  be) 
min{p,  ( k  +  l)dim(^l)  +  k}. 

Combining  these  results  we  have  that  estimates  of  the  dimension  of  the  tangent  space 
T(x,  lead  directly  to  lower  bounds  on  the  number  of  measurements  required  for 

recovery.  The  intuition  here  is  clear  as  the  number  of  measurements  required  must 
be  bounded  below  by  the  number  of  “degrees  of  freedom,”  which  is  captured  by  the 
dimension  of  the  tangent  space  T(x,  However  Terracini’s  lemma  provides  us  with 

general  estimates  of  the  dimension  of  T(x,  for  generic  points  x.  Therefore  we  can 

directly  obtain  lower  bounds  on  the  number  of  measurements,  purely  by  considering  the 
dimension  of  the  variety  A  and  the  number  of  elements  from  A  used  to  construct  x  (i.e., 
the  order  of  the  secant  variety  in  which  x  lies).  As  an  example  the  dimension  of  the 
base  variety  of  normalized  order-three  tensors  in  mxmxm  -g  3^ m  _  Consequently  if 
we  were  to  in  principle  solve  the  tensor  nuclear  norm  minimization  problem,  we  should 
expect  to  require  at  least  0(km )  measurements  to  recover  a  rank-A:  tensor. 
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■  5.5  Computational  Experiments 

■  5.5.1  Algorithmic  considerations 

While  a  variety  of  atomic  norms  can  be  represented  or  approximated  by  linear  matrix 
inequalities,  these  representations  do  not  necessarily  translate  into  practical  implemen¬ 
tations.  Semidefinite  programming  can  be  technically  solved  in  polynomial  time,  but 
general  interior  point  solvers  typically  only  scale  to  problems  with  a  few  hundred  vari¬ 
ables.  For  larger  scale  problems,  it  is  often  preferable  to  exploit  structure  in  the  atomic 
set  A  to  develop  fast,  first-order  algorithms. 

A  starting  point  for  first-order  algorithm  design  lies  in  determining  the  structure  of 
the  proximity  operator  (or  Moreau  envelope)  associated  with  the  atomic  norm, 

II_4(x;  fj,)  :=  argrnin  ^\\z  —  x 1 1 2  +  /r||z||_4  .  (5.16) 

Here  [i  is  some  positive  parameter.  Proximity  operators  have  already  been  harnessed 
for  fast  algorithms  involving  the  i\  norm  [39,  40,  66,  78,  144]  and  the  nuclear  norm 
[26, 100, 137]  where  these  maps  can  be  quickly  computed  in  closed  form.  For  the  i\ 
norm,  the  ith  component  of  n_4(x;  [f]  is  given  by 

Xj  +  fj,  x,  <  ~n 

n^l(x;  n)i  =  <  0  -n  <  Xi  <  n  ■  (5.17) 

x,  -  /x  Xj  >  n 

This  is  the  so-called  soft  thresholding  operator.  For  the  nuclear  norm,  n_4  soft  thresholds 
the  singular  values.  In  either  case,  the  only  structure  necessary  for  the  cited  algorithms 
to  converge  is  the  convexity  of  the  norm.  Indeed,  essentially  any  algorithm  developed 
for  t\  or  nuclear  norm  minimization  can  in  principle  be  adapted  for  atomic  norm  min¬ 
imization.  One  simply  needs  to  apply  the  operator  n_4  wherever  a  shrinkage  operation 
was  previously  applied. 

For  a  concrete  example,  suppose  /  is  a  smooth  function,  and  consider  the  optimiza¬ 
tion  problem 

min  /(x)  +n\\x\\A.  (5.18) 

X 

The  classical  projected  gradient  method  for  this  problem  alternates  between  taking 
steps  along  the  gradient  of  /  and  then  applying  the  proximity  operator  associated  with 
the  atomic  norm.  Explicitly,  the  algorithm  consists  of  the  iterative  procedure 


Xfc+i  =  n^(xfc  -  afcV/(xfc);afc A) 


(5.19) 
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where  {oik}  is  a  sequence  of  positive  stepsizes.  Under  very  mild  assumptions,  this 
iteration  can  be  shown  to  converge  to  a  stationary  point  of  (5.18)  [68].  When  /  is 
convex,  the  returned  stationary  point  is  a  globally  optimal  solution.  Recently,  Nesterov 
has  described  a  particular  variant  of  this  algorithm  that  is  guaranteed  to  converge  at  a 
rate  no  worse  than  0(k~1),  where  k  is  the  iteration  counter  [112].  Moreover,  he  proposes 
simple  enhancements  of  the  standard  iteration  to  achieve  an  0(fc~2)  convergence  rate 
for  convex  /  and  a  linear  rate  of  convergence  for  strongly  convex  /. 

If  we  apply  the  projected  gradient  method  to  the  regularized  inverse  problem 

min  ||<f>x  —  y || 2  +  A||x|L  (5.20) 

X 

then  the  algorithm  reduces  to  the  straightforward  iteration 

xfc+i  =  n^(xfc  +  -  «hxfc);  ak A) .  (5-21) 

Here  (5.20)  is  equivalent  to  (5.7)  for  an  appropriately  chosen  A  >  0  and  is  useful  for 
estimation  from  noisy  measurements. 

The  basic  (noiseless)  atomic  norm  minimization  problem  (5.5)  can  be  solved  by 
minimizing  a  sequence  of  instances  of  (5.20)  with  monotonically  decreasing  values  of 
A.  Each  subsequent  minimization  is  initialized  from  the  point  returned  by  the  previous 
step.  Such  an  approach  corresponds  to  the  classic  Method  of  Multipliers  [12]  and  has 
proven  effective  for  solving  problems  regularized  by  the  i\  norm  and  for  total  variation 
denoising  [27, 146] . 

This  discussion  demonstrates  that  when  the  proximity  operator  associated  with 
some  atomic  set  A  can  be  easily  computed,  then  efficient  first-order  algorithms  are 
immediate.  For  novel  atomic  norm  applications,  one  can  thus  focus  on  algorithms  and 
techniques  to  compute  proximity  operators  associated.  We  note  that,  from  a  computa¬ 
tional  perspective,  it  may  be  easier  to  compute  the  proximity  operator  via  dual  atomic 
norm.  Associated  to  each  proximity  operator  is  the  dual  operator 

A^(x;  n)  =  argmin  ^Hy  -  x||2  s.t,  J|y ||^  <  /x  (5.22) 

By  an  appropriate  change  of  variables,  A_4  is  nothing  more  than  the  projection  of  /i_1x 
onto  the  unit  ball  in  the  dual  atomic  norm: 
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From  convex  programming  duality,  we  have  x  =  II_4(x;  /i)  +  A_4(x;  /j).  This  can  be 
seen  by  observing 

min  |  || z  —  x | j 2  +  /t||z||_4  =  min  max  |||z  —  x| | 2  +  (y,  z)  (5-24) 

z  z  lly|lh<M 

=  max  min  |||z  —  x| 1 2  +  (y,  z)  (5.25) 

\\y\\*A<»  z  2 

=  max  —  Ally  —  xll 2  +  |||x||2  (5.26) 

\\y\\*A<»  2  2 

In  particular,  n_4(x;/r)  and  A_4 (x; //)  form  a  complementary  primal-dual  pair  for  this 
optimization  problem.  Hence,  we  only  need  to  able  to  efficiently  compute  the  Euclidean 
projection  onto  the  dual  norm  ball  to  compute  the  proximity  operator  associated  with 
the  atomic  norm. 

Finally,  though  the  proximity  operator  provides  an  elegant  framework  for  algorithm 
generation,  there  are  many  other  possible  algorithmic  approaches  that  may  be  employed 
to  take  advantage  of  the  particular  structure  of  an  atomic  set  A.  For  instance,  we  can 
rewrite  (5.22)  as 

AHx;//)  =  argmin  |||y  — /r_1x||2  s.t.  (y,a)<l  Va  G  A  (5-27) 

y 

Suppose  we  have  access  to  a  procedure  that,  given  z  e  Mn,  can  decide  whether  (z,  a)  <  1 
for  all  a  £  A,  or  can  find  a  violated  constraint  where  (z,a)  >  1.  In  this  case,  we  can 
apply  a  cutting  plane  method  or  ellipsoid  method  to  solve  (5.22)  or  (5.6)  [111,  117]. 
Similarly,  if  it  is  simpler  to  compute  a  subgradient  of  the  atomic  norm  than  it  is  to 
compute  a  proximity  operator,  then  the  standard  subgradient  method  [13,  111]  can  be 
applied  to  solve  problems  of  the  form  (5.20).  Each  computational  scheme  will  have 
different  advantages  and  drawbacks  for  specific  atomic  sets,  and  relative  effectiveness 
needs  to  be  evaluated  on  a  case-by-case  basis. 

■  5.5.2  Simulation  results 

We  describe  the  results  of  numerical  experiments  in  recovering  orthogonal  matrices, 
permutation  matrices,  and  rank-one  sign  matrices  (i.e.,  cut  matrices)  from  random 
linear  measurements  by  solving  convex  optimization  problems.  All  the  atomic  norm 
minimization  problems  in  these  experiments  are  solved  using  a  combination  of  the 
SDPT3  package  [136]  and  the  YALMIP  parser  [98]. 

Orthogonal  matrices.  We  consider  the  recovery  of  20  x  20  orthogonal  matrices 
from  random  Gaussian  measurements  via  spectral  norm  minimization.  Specifically  we 
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20x20  permutation  matrix  20x20  cut  matrix  20x20  orthogonal  matrix 


Figure  5.3.  Plots  of  the  number  of  measurements  available  versus  the  probability  of  exact  recovery 
(computed  over  50  trials)  for  various  models. 


solve  the  convex  program  (5.5),  with  the  atomic  norm  being  the  spectral  norm.  Fig¬ 
ure  5.3  gives  a  plot  of  the  probability  of  exact  recovery  (computed  over  50  random 
trials)  versus  the  number  of  measurements  required. 

Permutation  matrices.  We  consider  the  recovery  of  20  x  20  permutation  matrices 
from  random  Gaussian  measurements.  We  solve  the  convex  program  (5.5),  with  the 
atomic  norm  being  the  norm  induced  by  the  Birkhoff  polytope  of  20  x  20  doubly  stochas¬ 
tic  matrices.  Figure  5.3  gives  a  plot  of  the  probability  of  exact  recovery  (computed  over 
50  random  trials)  versus  the  number  of  measurements  required. 

Cut  matrices.  We  consider  the  recovery  of  20  x  20  cut  matrices  from  random 
Gaussian  measurements.  As  the  cut  polytope  is  intractable  to  characterize,  we  solve  the 
convex  program  (5.5)  with  the  atomic  norm  being  approximated  by  the  norm  induced 
by  the  semidefinite  relaxation  Vi  described  in  Section  5.4.3.  Figure  5.3  gives  a  plot  of 
the  probability  of  exact  recovery  (computed  over  50  random  trials)  versus  the  number 
of  measurements  required. 

In  each  of  these  experiments  we  see  agreement  between  the  observed  phase  transi¬ 
tions,  and  the  theoretical  predictions  (Propositions  5.3.5,  5.3.6,  and  5.4.1)  of  the  number 
of  measurements  required  for  exact  recovery.  In  particular  note  that  the  phase  transi¬ 
tion  in  Figure  5.3  for  the  number  of  measurements  required  for  recovering  an  orthogonal 
matrix  is  very  close  to  the  prediction  n  ~  3m4~m  =  295  of  Proposition  5.3.5.  We  re¬ 
fer  the  reader  to  [55, 102, 121]  for  similar  phase  transition  plots  for  recovering  sparse 
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vectors,  low-rank  matrices,  and  signed  vectors  from  random  measurements  via  convex 
optimization. 

■  5.6  Discussion 

This  chapter  has  illustrated  that  for  a  fixed  set  of  base  atoms,  the  atomic  norm  is  the 
best  choice  of  a  convex  regularizer  for  solving  ill-posed  inverse  problems  with  the  pre¬ 
scribed  priors.  With  this  in  mind,  our  results  in  Section  5.3  and  Section  5.4  outline 
methods  for  computing  hard  limits  on  the  number  of  measurements  required  for  recov¬ 
ery  from  any  convex  heuristic.  Using  the  calculus  of  Gaussian  widths,  such  bounds 
can  be  computed  in  a  relatively  straightforward  fashion,  especially  if  one  can  appeal  to 
notions  of  convex  duality  and  symmetry.  This  computational  machinery  of  widths  and 
dimension  counting  is  surprisingly  powerful:  near-optimal  bounds  on  estimating  sparse 
vectors  and  low-rank  matrices  from  partial  information  follow  from  elementary  inte¬ 
gration.  Thus  we  expect  that  our  new  bounds  concerning  symmetric,  vertex-transitive 
polytopes  are  also  nearly  tight.  Moreover,  algebraic  reasoning  allowed  us  to  explore  the 
inherent  trade-offs  between  computational  efficiency  and  measurement  demands.  More 
complicated  algorithms  for  atomic  norm  regularization  might  extract  structure  from 
less  information,  but  approximation  algorithms  are  often  sufficient  for  near  optimal 
reconstructions. 

This  chapter  serves  as  a  foundation  for  many  new  exciting  directions  in  inverse 
problems,  and  we  close  our  discussion  with  a  description  of  several  natural  possibilities 
for  future  work: 

Width  calculations  for  more  atomic  sets.  The  calculus  of  Gaussian  widths  described  in 
Section  5.3  provides  the  building  blocks  for  computing  the  Gaussian  widths  for  the 
application  examples  discussed  in  Section  5.2.  We  have  not  yet  exhaustively  estimated 
the  widths  in  all  of  these  examples,  and  a  thorough  cataloging  of  the  measurement 
demands  associated  with  different  prior  information  would  provide  a  more  complete 
understanding  of  the  fundamental  limits  of  solving  underdetermined  inverse  problems. 
Moreover,  our  list  of  examples  is  by  no  means  exhaustive.  The  framework  developed  in 
this  chapter  provides  a  compact  and  efficient  methodology  for  constructing  regularizers 
from  very  general  prior  information,  and  new  regularizers  can  be  easily  created  by 
translating  grounded  expert  knowledge  into  new  atomic  norms. 
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Atomic  norm  decompositions.  While  the  techniques  of  Section  5.3  and  Section  5.4  pro¬ 
vide  bounds  on  the  estimation  of  points  in  low-dimensional  secant  varieties  of  atomic 
sets,  they  do  not  provide  a  procedure  for  actually  constructing  decompositions.  That 
is,  we  have  provided  bounds  on  the  number  of  measurements  required  to  recover  points 
x  of  the  form 

x  =  ^  caa 

ae.4 

when  the  coefficient  sequence  {ca}  is  sparse,  but  we  do  not  provide  any  methods  for 
actually  recovering  c  itself.  These  decompositions  are  useful,  for  instance,  in  actually 
computing  the  rank-one  binary  vectors  optimized  in  semidefinite  relaxations  of  com¬ 
binatorial  algorithms  [3,72,110],  or  in  the  computation  of  tensor  decompositions  from 
incomplete  data  [91].  Is  it  possible  to  use  algebraic  structure  to  generate  determin¬ 
istic  or  randomized  algorithms  for  reconstructing  the  atoms  that  underlie  a  vector  x, 
especially  when  approximate  norms  are  used? 

Large-scale  algorithms.  Finally,  we  think  that  the  most  fruitful  extensions  of  this  work 
lie  in  a  thorough  exploration  of  the  empirical  performance  and  efficacy  of  atomic  norms 
on  large-scale  inverse  problems.  The  proposed  algorithms  in  Section  5.5  require  only 
the  knowledge  of  the  proximity  operator  of  an  atomic  norm,  or  a  Euclidean  projection 
operator  onto  the  dual  norm  ball.  Using  these  design  principles  and  the  geometry  of 
particular  atomic  norms  should  enable  the  scaling  of  atomic  norm  techniques  to  massive 
data  sets. 
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Convex  Graph  Invariants 


■  6.1  Introduction 

Graphs  are  useful  in  many  applications  throughout  science  and  engineering  as  they 
offer  a  concise  model  for  relationships  among  a  large  number  of  interacting  entities. 
These  relationships  are  often  best  understood  using  structural  properties  of  graphs. 
Graph  invariants  play  an  important  role  in  characterizing  abstract  structural  features 
of  a  graph,  as  they  do  not  depend  on  the  labeling  of  the  nodes  of  the  graph.  Indeed 
families  of  graphs  that  share  common  structural  attributes  are  often  specified  via  graph 
invariants.  For  example  bipartite  graphs  can  be  defined  by  the  property  that  they 
contain  no  cycles  of  odd  length,  while  the  family  of  regular  graphs  consists  of  graphs  in 
which  all  nodes  have  the  same  degree.  Such  descriptions  of  classes  of  graphs  in  terms 
of  invariants  have  found  applications  in  areas  as  varied  as  combinatorics  [48],  network 
analysis  in  chemistry  [21]  and  in  biology  [105],  and  in  machine  learning  [93].  For  instance 
the  treewidth  [123]  of  a  graph  is  a  basic  invariant  that  governs  the  complexity  of  various 
algorithms  for  graph  problems. 

We  begin  by  introducing  three  canonical  problems  involving  structural  properties  of 
graphs,  and  the  development  of  a  unified  solution  framework  to  address  these  questions 
serves  as  motivation  for  our  discussion  throughout  this  chapter. 

•  Graph  deconvolution.  Suppose  we  are  given  a  graph  that  is  the  combination  of 
two  known  graphs  overlaid  on  the  same  set  of  nodes.  How  do  we  recover  the  indi¬ 
vidual  components  from  the  composite  graph?  For  example  in  Figure  6.1  we  are 
given  a  composite  graph  that  is  formed  by  adding  a  cycle  and  the  Clebsch  graph. 
Given  no  extra  knowledge  of  any  labeling  of  the  nodes,  can  we  “deconvolve”  the 
composite  graph  into  the  individual  cycle/ Clebsch  graph  components? 

•  Graph  generation.  Given  certain  structural  constraints  specified  by  invariants 
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how  do  we  produce  a  graph  that  satisfies  these  constraints?  A  well-studied  exam¬ 
ple  is  the  question  of  constructing  expander  graphs.  Another  example  may  be  that 
we  wish  to  recover  a  graph  given  constraints,  for  instance,  on  certain  subgraphs 
being  forbidden,  on  the  degree  distribution,  and  on  the  spectral  distribution. 

•  Graph  hypothesis  testing.  Suppose  we  have  two  families  of  graphs,  each  char¬ 
acterized  by  some  common  structural  properties  specified  by  a  set  of  invariants; 
given  a  new  sample  graph  which  of  the  two  families  offers  a  “better  explanation” 
of  the  sample  graph  (see  Figure  6.2)? 

In  Section  6.2  we  describe  these  problems  in  more  detail,  and  also  give  some  concrete 
applications  in  network  analysis  and  modeling  in  which  such  questions  are  of  interest. 

To  efficiently  solve  problems  such  as  these  we  wish  to  develop  a  collection  of  tractable 
computational  tools.  Convex  relaxation  techniques  offer  a  candidate  framework  as  they 
possess  numerous  favorable  properties.  Due  to  their  powerful  modeling  capabilities, 
convex  optimization  methods  can  provide  tractable  formulations  for  solving  difficult 
combinatorial  problems  exactly  or  approximately.  Further  convex  programs  may  often 
be  solved  effectively  using  general-purpose  off-the-shelf  software.  Finally  one  can  also 
give  conditions  for  the  success  of  these  convex  relaxations  based  on  standard  optimality 
results  from  convex  analysis. 

Motivated  by  these  considerations  we  introduce  and  study  convex  graph  invariants 
in  Section  6.3.  These  invariants  are  convex  functions  of  the  adjacency  matrix  of  a  graph. 
More  formally  letting  A  denote  the  adjacency  matrix  of  a  (weighted)  graph,  a  convex 
graph  invariant  is  a  convex  function  /  such  that  f(A)  =  /(IIAnr)  for  all  permutation 
matrices  II.  Examples  include  functions  of  a  graph  such  as  the  maximum  degree,  the 
MAXCUT  value  (and  its  semidefinite  relaxation),  the  second  smallest  eigenvalue  of  the 
Laplacian  (a  concave  invariant),  and  spectral  invariants  such  as  the  sum  of  the  k  largest 
eigenvalues;  see  Section  6.3.3  for  a  more  comprehensive  list.  As  some  of  these  invariants 
may  possibly  be  hard  to  compute,  we  discuss  in  the  sequel  the  question  of  approximating 
intractable  convex  invariants.  We  also  study  invariant  convex  sets,  which  are  convex 
sets  with  the  property  that  a  symmetric  matrix  A  is  a  member  of  such  a  set  if  and 
only  if  nAnT  is  also  a  member  of  the  set  for  all  permutations  II.  Such  convex  sets 
are  useful  in  order  to  impose  various  structural  constraints  on  graphs.  For  example 
invariant  convex  sets  can  be  used  to  express  forbidden  subgraph  constraints  (i.e.,  that 
a  graph  does  not  contain  a  particular  subgraph  such  as  a  triangle),  or  require  that  a 
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graph  be  connected;  see  Section  6.3.4  for  more  examples.  We  compare  the  strengths  and 
weaknesses  of  convex  graph  invariants  versus  more  general  non-convex  graph  invariants. 
Finally  we  also  provide  a  robust  optimization  perspective  of  invariant  convex  sets.  In 
particular  we  make  connections  between  our  work  and  the  data-driven  perspective  on 
robust  optimization  studied  in  [14]. 

In  order  to  systematically  evaluate  the  expressive  power  of  convex  graph  invariants 
we  analyze  elementary  convex  graph  invariants,  which  serve  as  a  basis  for  constructing 
arbitrary  convex  invariants.  Given  a  symmetric  matrix  P,  these  elementary  invariants 
(again,  possibly  hard  to  compute  depending  on  the  choice  of  P)  are  defined  as  follows: 

Qp(A)  =  max  Tr(PIL4IIT) ,  (6.1) 

where  A  represents  the  adjacency  matrix  of  a  graph,  and  the  maximum  is  taken  over  all 
permutation  matrices  II.  It  is  clear  that  Qp  is  a  convex  graph  invariant,  because  it  is 
expressed  as  the  maximum  over  a  set  of  linear  functions.  Indeed  several  simple  convex 
graph  invariants  can  be  expressed  using  functions  of  the  form  (6.1).  For  example  P  =  I 
gives  us  the  total  sum  of  the  node  weights,  while  P  =  11T  —  I  gives  us  twice  the  total 
(weighted)  degree.  Our  main  theoretical  results  in  Section  6.3  can  be  summarized  as 
follows:  First  we  give  a  representation  theorem  stating  that  any  convex  graph  invariant 
can  be  expressed  as  the  supremum  over  elementary  convex  graph  invariants  (6.1)  (see 
Theorem  6.3.1).  Second  we  have  a  similar  result  stating  that  any  invariant  convex  set 
can  be  expressed  as  the  intersection  of  convex  sets  given  by  level  sets  of  the  elementary 
invariants  (6.1)  (see  Proposition  6.3.1).  These  results  follow  as  a  consequence  of  the 
separation  theorem  from  convex  analysis.  Finally  we  also  show  that  for  any  two  non¬ 
isomorphic  graphs  given  by  adjacency  matrices  A\  and  A2,  there  exists  a  P  such  that 
0p(vli)  7^  0p(^2)  (see  Lemma  6.3.1).  Hence  convex  graph  invariants  offer  a  complete 
set  of  invariants  as  they  can  distinguish  between  non-isomorphic  graphs. 

In  Section  6.3.7  we  discuss  an  important  subclass  of  convex  graph  invariants,  namely 
the  set  of  convex  spectral  invariants.  These  are  convex  functions  of  symmetric  matrices 
that  depend  only  on  the  eigenvalues,  and  can  equivalently  be  expressed  as  the  set 
of  convex  functions  of  symmetric  matrices  that  are  invariant  under  conjugation  by 
orthogonal  matrices  (note  that  convex  graph  invariants  are  only  required  to  be  invariant 
with  respect  to  conjugation  by  permutation  matrices)  [42],  The  properties  of  convex 
spectral  invariants  are  well-understood,  and  they  are  useful  in  a  number  of  practically 
relevant  problems  (e.g.,  characterizing  the  subdifferential  of  a  unitarily  invariant  matrix 
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norm  [142]).  These  invariants  play  a  prominent  role  in  our  experimental  demonstrations 
in  Section  6.5. 

As  noted  above  convex  graph  invariants,  and  even  elementary  invariants,  may  in 
general  be  hard  to  compute.  In  Section  6.4  we  investigate  the  question  of  approxi¬ 
mately  computing  these  invariants  in  a  tractable  manner.  For  many  interesting  special 
cases  such  as  the  MAXCUT  value  of  a  graph,  or  (the  inverse  of)  the  stability  number, 
there  exist  well-known  tractable  semidefinite  programming  (SDP)  relaxations  that  can 
be  used  as  surrogates  instead  [72, 109].  More  generally  functions  of  the  form  of  our 
elementary  convex  invariants  (6.1)  have  appeared  previously  in  the  literature;  see  [32] 
for  a  survey.  Specifically  we  note  that  evaluating  the  function  0p(A)  for  any  fixed  A,  P 
is  equivalent  to  solving  the  so-called  Quadratic  Assignment  Problem  (QAP),  and  thus 
we  can  employ  various  tractable  linear  programming,  spectral,  and  SDP  relaxations 
of  QAP  [32,122,147].  In  particular  we  discuss  recent  work  [43]  on  exploiting  group 
symmetry  in  SDP  relaxations  of  QAP,  which  is  useful  for  approximately  computing 
elementary  convex  graph  invariants  in  many  interesting  cases. 

Finally  in  Section  6.5  we  return  to  the  motivating  problems  described  previously, 
and  give  solutions  to  these  questions.  These  solutions  are  based  on  convex  program¬ 
ming  formulations,  with  convex  graph  invariants  playing  a  fundamental  role.  We  give 
theoretical  conditions  for  the  success  of  these  convex  formulations  in  solving  the  prob¬ 
lems  discussed  above,  and  experimental  demonstration  for  their  effectiveness  in  practice. 
Indeed  the  framework  provided  by  convex  graph  invariants  allows  for  a  unified  inves¬ 
tigation  of  our  proposed  solutions.  As  an  example  result  we  give  a  tractable  convex 
program  (in  fact  an  SDP)  in  Section  6.5.1  to  “deconvolve”  the  cycle  and  the  Clebsch 
graph  from  a  composite  graph  consisting  of  these  components  (see  Figure  6.1);  a  salient 
feature  of  this  convex  program  is  that  it  only  uses  spectral  invariants  to  perform  the 
decomposition. 

Summary  of  contributions  We  emphasize  again  the  main  contributions  of  this  chapter. 
We  begin  by  introducing  three  canonical  problems  involving  structural  properties  of 
graphs.  These  problems  arise  in  various  applications  (see  Section  6.2),  and  serve  as 
a  motivation  for  our  discussion  in  this  chapter.  In  order  to  solve  these  problems  we 
introduce  convex  graph  invariants,  and  investigate  their  properties  (see  Section  6.3). 
Specifically  we  provide  a  representation  theorem  of  convex  graph  invariants  in  terms 
of  elementary  invariants,  and  we  make  connections  between  these  ideas  and  concepts 
from  other  areas  such  as  robust  optimization.  Finally  we  describe  tractable  convex 
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programming  solutions  to  the  motivating  problems  based  on  convex  graph  invariants 
(see  Section  6.5).  Therefore,  convex  graph  invariants  provide  a  useful  computational 
framework  based  on  convex  optimization  for  graph  problems. 

Related  previous  work  We  note  that  convex  optimization  methods  have  been  used  previ¬ 
ously  to  solve  various  graph-related  problems.  We  would  particularly  like  to  emphasize 
a  body  of  work  on  convex  programming  formulations  to  optimize  convex  functions  of 
the  Laplacian  eigenvalues  of  graphs  [22,  23]  subject  to  various  constraints.  Although 
our  objective  is  similar  in  that  we  seek  solutions  based  on  convex  optimization  to  graph 
problems,  our  work  is  different  in  several  respects  from  these  previous  approaches.  While 
the  problems  discussed  in  [22]  explicitly  involved  the  optimization  of  spectral  functions, 
other  graph  problems  such  as  those  described  in  Section  6.2  may  require  non-spectral 
approaches  (for  example,  hypothesis  testing  between  two  families  of  graphs  that  are 
isospectral,  i.e.,  have  the  same  spectrum,  but  are  distinguished  by  other  structural 
properties).  As  convex  spectral  invariants  form  a  subset  of  convex  graph  invariants, 
the  framework  proposed  in  this  chapter  offers  a  larger  suite  of  convex  programming 
methods  for  graph  problems.  More  broadly  our  work  is  the  first  to  formally  introduce 
and  characterize  convex  graph  invariants,  and  to  investigate  their  properties  as  natural 
mathematical  objects  of  independent  interest. 

Outline  In  Section  6.2  we  give  more  details  of  the  questions  that  motivate  our  study 
of  convex  graph  invariants.  Section  6.3  gives  the  definition  of  convex  graph  invariants 
and  invariant  convex  sets,  as  well  as  several  examples  of  these  such  functions  and 
sets.  We  also  discuss  various  properties  of  convex  graph  invariants  in  this  section. 
In  Section  6.4  we  investigate  the  question  of  efficiently  computing  approximations  to 
intractable  convex  graph  invariants.  We  give  detailed  solutions  using  convex  graph 
invariants  to  each  of  our  motivating  problems  in  Section  6.5,  and  we  conclude  with  a 
brief  discussion  in  Section  6.6. 

■  6.2  Applications 

In  this  section  we  describe  three  problems  involving  structural  properties  of  graphs, 
which  serve  as  a  motivation  for  our  investigation  of  convex  graph  invariants.  In  Sec¬ 
tion  6.5  we  give  solutions  to  these  problems  using  convex  graph  invariants. 
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■  6.2.1  Graph  deconvolution 

Suppose  we  are  given  a  graph  that  is  formed  by  overlaying  two  graphs  on  the  same  set 
of  nodes.  More  formally  we  have  a  graph  whose  adjacency  matrix  is  formed  by  adding 
the  adjacency  matrices  of  two  known  graphs.  However,  we  do  not  have  any  information 
about  the  relative  labeling  of  the  nodes  in  the  two  component  graphs.  Can  we  recover 
the  individual  components  from  the  composite  graph?  As  an  example  suppose  we  are 
given  the  combination  of  a  cycle  and  a  grid,  or  a  cycle  and  the  Clebsch  graph,  on 
the  same  set  of  nodes.  Without  any  additional  information  about  the  labeling  of  the 
nodes,  which  may  reveal  the  cycle/grid  or  cycle/Clebsch  graph  structure,  the  goal  is 
to  recover  the  individual  components.  Figure  6.1  gives  a  graphical  illustration  of  this 
question.  In  general  such  decomposition  problems  may  be  ill-posed,  and  it  is  of  interest 
to  give  conditions  under  which  unique  deconvolution  is  possible  as  well  as  to  provide 
tractable  computational  methods  to  recover  the  individual  components.  In  Section  6.5.1 
we  describe  an  approach  based  on  convex  optimization  for  graph  deconvolution;  for 
example  this  method  decomposes  the  cycle  and  the  Clebsch  graph  from  a  composite 
graph  consisting  of  these  components  (see  Figure  6.1)  using  only  the  spectral  properties 
of  the  two  graphs. 

Well-known  problems  that  have  the  flavor  of  graph  deconvolution  include  the  planted 
clique  problem,  which  involves  identifying  hidden  cliques  embedded  inside  a  larger 
graph,  and  the  clustering  problem  in  which  the  goal  is  to  decompose  a  large  graph 
into  smaller  densely  connected  clusters  by  removing  just  a  few  edges.  Convex  optimiza¬ 
tion  approaches  for  solving  such  problems  have  been  proposed  recently  [4,  5] .  Graph 
deconvolution  more  generally  may  include  other  kinds  of  embedded  structures  beyond 
cliques. 

Applications  of  graph  deconvolution  arise  in  network  analysis  in  which  one  seeks 
to  better  understand  a  complex  network  by  decomposing  it  into  simpler  components. 
Graphs  play  an  important  role  in  modeling,  for  example,  biological  networks  [105]  and 
social  networks  [59,  83] ,  and  lead  to  natural  graph  deconvolution  problems  in  these 
areas.  For  instance  graphs  are  useful  for  describing  social  exchange  networks  of  in¬ 
teractions  of  multiple  agents,  and  graph  decompositions  are  useful  for  describing  the 
structure  of  optimal  bargaining  solutions  in  such  networks  [89] .  In  a  biological  network 
setting,  transcriptional  regulatory  networks  of  bacteria  have  been  observed  to  consist 
of  small  subgraphs  with  specific  structure  (called  motifs)  that  are  connected  together 
using  a  “backbone”  [49].  Decomposing  such  regulatory  networks  into  the  component 
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Cycle  Clebsch  graph 


Figure  6.1.  An  instance  of  a  deconvolution  problem:  Given  a  composite  graph  formed  by  adding  the 
16-cycle  and  the  Clebsch  graph,  we  wish  to  recover  the  individual  components.  The  Clebsch  graph  is 
an  example  of  a  strongly  regular  graph  on  16  nodes  [70];  see  Section  6.5.1  for  more  details  about  the 
properties  of  such  graphs. 


structures  is  useful  for  obtaining  a  better  understanding  of  the  high-level  properties  of 
the  composite  network. 

■  6.2.2  Generating  graphs  with  desired  structural  properties 

Suppose  we  wish  to  construct  a  graph  with  certain  prescribed  structural  constraints.  A 
very  simple  example  may  be  the  problem  of  constructing  a  graph  in  which  each  node 
has  degree  equal  to  two.  A  graph  given  by  a  single  cycle  satisfies  this  constraint.  A 
less  trivial  problem  is  one  in  which  the  objective  may  be  to  build  a  connected  graph 
with  constraints  on  the  spectrum  of  the  adjacency  matrix,  the  degree  distribution, 
and  the  additional  requirements  that  the  graph  be  triangle-free  and  square-free.  Of 
course  such  graph  reconstruction  problems  may  be  infeasible  in  general,  as  there  may 
be  no  graph  consistent  with  the  given  constraints.  Therefore  it  is  of  interest  to  derive 
suitable  conditions  under  which  this  problem  may  be  well-posed,  and  to  develop  a 
suitably  flexible  yet  tractable  computational  framework  to  incorporate  any  structural 
information  available  about  a  graph. 

A  prominent  instance  of  a  graph  construction  problem  that  has  received  much  at¬ 
tention  is  the  question  of  generating  expander  graphs  [81].  Expanders  are,  roughly 
speaking,  sparse  graphs  that  are  well-connected,  and  they  have  found  applications 
in  numerous  areas  of  computer  science.  Methods  used  to  construct  expanders  range 
from  random  sampling  approaches  to  deterministic  constructions  based  on  Ramanujan 
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Figure  6.2.  An  instance  of  a  hypothesis  testing  problem:  We  wish  to  decide  which  family  of  graphs 
offers  a  “better  explanation”  for  a  given  candidate  sample  graph. 

graphs  [81].  In  Section  6.5.2  we  describe  an  approach  based  on  convex  optimization  to 
generate  sparse,  weighted  graphs  with  small  degree  and  large  spectral  gap. 

■  6.2.3  Graph  hypothesis  testing 

As  our  third  problem  we  consider  a  more  statistically  motivated  question.  Suppose  we 
have  two  families  of  graphs  each  characterized  by  some  common  structural  properties 
specified  by  certain  invariants.  Given  a  new  sample  graph  which  of  these  two  families 
offers  a  “better  explanation”  for  the  sample  graph?  For  example  as  illustrated  in  Fig¬ 
ure  6.2  we  may  have  two  families  of  graphs  -  one  being  the  collection  of  cycles,  and 
the  other  being  the  set  of  sparse,  well-connected  graphs.  If  a  new  sample  graph  is  a 
path  (i.e. ,  a  cycle  with  an  edge  removed),  we  would  expect  that  the  family  of  cycles 
should  be  a  better  explanation.  On  the  other  hand  if  the  sample  is  a  cycle  plus  some 
edges  connecting  diametrically  opposite  nodes,  then  the  second  family  of  sparse,  well- 
connected  graphs  offers  a  more  plausible  fit.  Notice  that  these  classes  of  graphs  may 
often  be  specified  in  terms  of  different  sets  of  invariants,  and  it  is  of  interest  to  develop  a 
suitable  framework  in  which  we  can  incorporate  diverse  structural  information  provided 
about  graph  families. 

We  differentiate  this  problem  from  the  well-studied  question  of  testing  properties  of 
graphs  [73].  Examples  of  property  testing  include  testing  whether  a  graph  is  3-colorable, 
or  whether  it  is  close  to  being  bipartite.  An  important  goal  in  property  testing  is  that 
one  wishes  to  test  for  graph  properties  by  only  making  a  small  number  of  “queries”  of 
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a  graph.  We  do  not  explicitly  seek  such  an  objective  in  our  algorithms  for  hypothesis 
testing.  We  also  note  that  hypothesis  testing  can  be  posed  more  generally  than  a  yes/no 
question  as  in  property  testing,  and  as  mentioned  above  the  two  families  in  hypothesis 
testing  may  be  specified  in  terms  of  very  different  sets  of  invariants. 

In  order  to  address  the  hypothesis  testing  question  in  a  statistical  framework,  we 
would  need  a  statistical  theory  for  graphs  and  appropriate  error  metrics  with  respect 
to  which  one  could  devise  optimal  decision  rules.  In  Section  6.5.3  we  discuss  a  com¬ 
putational  approach  to  the  hypothesis  testing  problem  using  convex  graph  invariants 
that  gives  good  empirical  performance,  and  we  defer  the  issue  of  developing  a  formal 
statistical  framework  to  future  work. 

■  6.3  Convex  Graph  Invariants 

In  this  section  we  define  convex  graph  invariants,  and  discuss  their  properties.  Through¬ 
out  this  chapter  we  denote  as  before  the  space  of  n  x  n  symmetric  matrices  by  S”  ~ 
m(  2  ).  All  our  definitions  of  convexity  are  with  respect  to  the  space  Sn.  We  consider 
undirected  graphs  that  do  not  have  multiple  edges  and  no  self- loops;  these  are  repre¬ 
sented  by  adjacency  matrices  that  lie  in  Sn.  Therefore  a  graph  may  possibly  have  node 
weights  and  edge  weights.  A  graph  is  said  to  be  unweighted  if  its  node  weights  are 
zero,  and  if  each  edge  has  a  weight  of  one  (non-edges  have  a  weight  of  zero);  otherwise 
a  graph  is  said  to  be  weighted.  Let  e,;  £  Mn  denote  the  vector  with  a  one  in  the  i’th 
entry  and  zero  elsewhere,  let  I  denote  the  n  x  n  identity  matrix,  let  1  £  Mn  denote 
the  all-ones  vector,  and  let  J  =  ll7  £  Sn  denote  the  all-ones  matrix.  Further  we  let 
A  =  {A  :  A  £  Sn,  0  <  Ai  j  <  1  Vi,  j};  we  will  sometimes  find  it  useful  in  our  examples 
in  Section  6.3.4  to  restrict  our  attention  to  graphs  with  adjacency  matrices  in  A.  Next 
let  Sym(n)  denote  the  symmetric  group  over  n  elements,  i.e.,  the  group  of  permutations 
of  n  elements.  Elements  of  this  group  are  represented  bynxn  permutation  matrices. 
Let  O(n)  represent  the  orthogonal  group  of  n  x  n  orthogonal  matrices.  Finally  given  a 
vector  x  £  Rn  we  recall  that  x  denotes  the  vector  obtained  by  sorting  the  entries  of  x 
in  descending  order. 

■  6.3.1  Motivation:  Graphs  and  adjacency  matrices 

Matrix  representations  of  graphs  in  terms  of  adjacency  matrices  and  Laplacians  have 
been  used  widely  both  in  applications  as  well  as  in  the  analysis  of  the  structure  of  graphs 
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based  on  algebraic  properties  of  these  matrices  [17].  For  example  the  spectrum  of  the 
Laplacian  of  a  graph  reveals  whether  a  graph  is  “diffusive”  [81],  or  whether  it  is  even 
connected.  The  degree  sequence,  which  may  be  obtained  from  the  adjacency  matrix  or 
the  Laplacian,  reveals  whether  a  graph  is  regular,  and  it  plays  a  role  in  a  number  of 
real-world  investigations  of  graphs  arising  in  social  networks  and  the  Internet. 

Given  a  graph  Q  defined  on  n  nodes,  a  labeling  of  the  nodes  of  Q  is  a  function  i  that 
maps  the  nodes  of  Q  onto  distinct  integers  in  {1, . . . ,  n}.  An  adjacency  matrix  A  €  S” 
is  then  said  to  represent  or  specify  Q  if  there  exists  a  labeling  t  of  the  nodes  of  Q  so  that 
the  weight  of  the  edge  between  nodes  i  and  j  equals  A^uyi^  for  all  pairs  {i,j}  and  the 
weight  of  node  i  equals  A^uyu\  for  all  i.  However  an  adjacency  matrix  representation 
A  of  the  graph  Q  is  not  unique.  In  particular  nAnT  also  specifies  Q  for  all  n  €  Syrn(n). 
All  these  alternative  adjacency  matrices  correspond  to  different  labelings  of  the  nodes  of 
Q.  Thus  the  graph  Q  is  specified  by  the  matrix  A  only  up  to  a  relabeling  of  the  indices  of 
A.  Our  objective  is  to  describe  abstract  structural  properties  of  Q  that  do  not  depend 
on  a  choice  of  labeling  of  the  nodes.  In  order  to  characterize  such  unlabeled  graphs 
in  which  the  nodes  have  no  distinct  identity  except  through  their  connections  to  other 
nodes,  it  is  important  that  any  function  of  an  adjacency  matrix  representation  of  a 
graph  not  depend  on  the  particular  choice  of  indices  of  A.  Therefore  we  seek  functions 
of  adjacency  matrices  that  are  invariant  under  conjugation  by  permutation  matrices, 
and  denote  such  functions  as  graph  invariants. 

■  6.3.2  Definition  of  convex  invariants 

A  convex  graph  invariant  is  an  invariant  that  is  a  convex  function  of  the  adjacency 
matrix  of  a  graph.  Specifically  we  have  the  following  definition: 

Definition  6.3.1.  A  function  f  :  Sn  — >  M  is  a  convex  graph  invariant  if  it  is  convex, 
and  if  for  any  A  6  Sn  it  holds  that  f(UAIlT)  =  f(A)  for  all  permutation  matrices 
n  G  Sym(n). 

Thus  convex  graph  invariants  are  convex  functions  that  are  constant  over  orbits  of 
the  symmetric  group  acting  on  symmetric  matrices  by  conjugation.  As  described  above 
the  motivation  behind  the  invariance  property  is  clear.  The  motivation  behind  the 
convexity  property  is  that  we  wish  to  construct  solutions  based  on  convex  programming 
formulations  in  order  to  solve  problems  such  as  those  listed  in  Section  6.2.  We  present 
several  examples  of  convex  graph  invariants  in  Section  6.3.3.  We  note  that  a  concave 
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graph  invariant  is  a  real- valued  function  over  S”  that  is  the  negative  of  a  convex  graph 
invariant. 

We  also  consider  invariant  convex  sets,  which  are  defined  in  an  analogous  manner 
to  convex  graph  invariants: 

Definition  6.3.2.  A  set  C  C  Sn  is  said  to  be  an  invariant  convex  set  if  it  is  convex  and 
if  for  any  A  £  C  it  is  the  case  that  IlEinT  €  C  for  all  permutation  matrices  II  E  Sym(n). 

In  Section  6.3.4  we  present  examples  in  which  graphs  can  be  constrained  to  have 
various  properties  by  requiring  that  adjacency  matrices  belong  to  such  convex  invariant 
sets.  We  also  make  connections  between  robust  optimization  and  invariant  convex  sets 
in  Section  6.3.6. 

In  order  to  systematically  study  convex  graph  invariants,  we  analyze  certain  ele¬ 
mentary  invariants  that  serve  as  a  basis  for  constructing  arbitrary  convex  invariants. 
These  elementary  invariants  are  defined  as  follows: 

Definition  6.3.3.  Gin  elementary  convex  graph  invariant  is  a  function  0p  :  Sn  — )•  M 
of  the  form 

QP(A)  =  max  Tr(PIL4IIT), 

neSym(n) 

for  any  P  E  Sn. 

It  is  clear  that  an  elementary  invariant  is  also  a  convex  graph  invariant,  as  it  is 
expressed  as  the  maximum  over  a  set  of  convex  functions  (in  fact  linear  functions).  We 
describe  various  properties  of  convex  graph  invariants  in  Sections  6.3.5.  One  useful  con¬ 
struction  that  we  give  is  the  expression  of  arbitrary  convex  graph  invariants  as  suprema 
over  elementary  invariants.  We  also  discuss  convex  spectral  invariants  in  Section  6.3.7, 
which  are  convex  functions  of  a  symmetric  matrix  that  depend  purely  on  its  spectrum. 
Finally  an  important  point  is  that  convex  graph  invariants  may  in  general  be  hard  to 
compute.  In  Section  6.4  we  discuss  this  problem  and  propose  further  tractable  convex 
relaxations  for  cases  in  which  a  convex  graph  invariant  may  be  intractable  to  compute. 

In  the  Appendix  we  describe  convex  functions  defined  on  Mn  that  are  invariant 
with  respect  to  any  permutation  of  the  argument.  Such  functions  have  been  analyzed 
previously,  and  we  provide  a  list  of  their  well-known  properties.  We  contrast  these 
properties  with  those  of  convex  graph  invariants  throughout  the  rest  of  this  section. 
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■  6.3.3  Examples  of  convex  graph  invariants 

We  list  several  examples  of  convex  graph  invariants.  As  mentioned  previously  some 
of  these  invariants  may  possibly  be  difficult  to  compute,  but  we  defer  discussion  of 
computational  issues  to  Section  6.4.  A  useful  property  that  we  exploit  in  several  of 
these  examples  is  that  a  function  defined  as  the  supremum  over  a  set  of  convex  functions 
is  itself  convex  [124], 

Number  of  edges.  The  total  number  of  edges  (or  sum  of  edge  weights)  is  an 
elementary  convex  graph  invariant  with  P  =  ^(ll1  —  I). 

Node  weight.  The  maximum  node  weight  of  a  graph,  which  corresponds  to  the 
maximum  diagonal  entry  of  the  adjacency  matrix  of  the  graph,  is  an  elementary  convex 
graph  invariant  with  P  =  eie7  .  The  maximum  diagonal  entry  in  magnitude  of  an 
adjacency  matrix  is  a  convex  graph  invariant,  and  can  be  expressed  as  follows  with 
P  =  eief: 

max.  node  weight(A)  =  nrax{0p(A),  0_p(A)}. 

Similarly  the  sum  of  all  the  node  weights,  which  is  the  sum  of  the  diagonal  entries  of  an 
adjacency  matrix  of  a  graph,  can  be  expressed  as  an  elementary  convex  graph  invariant 
with  P  being  the  identity  matrix. 

Maximum  degree.  The  maximum  (weighted)  degree  of  a  node  of  a  graph  is  also 
an  elementary  convex  graph  invariant  with  P\^  =  P%)\  =1,  Vz  ^  1,  and  all  the  other 
entries  of  P  set  to  zero. 

Largest  cut.  The  value  of  the  largest  weighted  cut  of  a  graph  specified  by  an 
adjacency  matrix  A  €  Sn  can  be  written  as  follows: 

max.  cut  (A)  =  max  -  Au(l  -  y^y,). 
ye{-i,+i}"  4  ' 

As  this  function  is  a  maximum  over  a  set  of  linear  functions,  it  is  a  convex  function 
of  A.  Further  it  is  also  clear  that  max.  cut  (A)  =  max.  cut  (II  All7 )  for  all  permutation 
matrices  II.  Consequently  the  value  of  the  largest  cut  of  a  graph  is  a  convex  graph 
invariant.  We  note  here  that  computing  this  invariant  is  intractable  in  general.  In 
practice  one  could  instead  employ  the  following  well-known  tractable  SDP  relaxation 
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[72],  which  is  related  to  the  MAXCUT  value  by  an  appropriate  shift  and  rescaling: 

f(A)  =  min  Tr(XA) 

s.t.  Xu  =  1,  Vi  (6-2) 

A  >-  0. 


As  this  relaxation  is  expressed  as  the  minimum  over  a  set  of  linear  functions,  it  is  a  con¬ 
cave  graph  invariant.  In  Section  6.4.2  we  discuss  in  greater  detail  tractable  relaxations 
for  invariants  that  are  difficult  to  compute. 

Isoperimetric  number  (Cheeger  constant).  The  isoperimetric  number,  also 
known  as  the  Cheeger  constant  [50],  of  a  graph  specified  by  adjacency  matrix  A  £  Sn 
is  defined  as  follows: 


isoperimetric  number  (A) 


min 

f/C{l,...,n},|C/|<|,yeRn,y[/=l,y(yc=-l 


X 


Ajj( !  ---  y,yj) 

m 


Here  Uc  =  {1, . . . ,  n}\U  denotes  the  complement  of  the  set  U,  and  yu  is  the  subset  of 
the  entries  of  the  vector  y  indexed  by  U .  As  with  the  last  example,  it  is  again  clear 
that  this  function  is  a  concave  graph  invariant  as  it  is  expressed  as  the  minimum  over 
a  set  of  linear  functions.  In  particular  it  can  be  viewed  as  measuring  the  value  of  a 
“normalized”  cut,  and  plays  an  important  role  in  several  aspects  of  graph  theory  [81]. 

Degree  sequence  invariants.  Given  a  graph  specified  by  adjacency  matrix  A 
(assume  for  simplicity  that  the  node  weights  are  zero),  the  weighted  degree  sequence  is 
given  by  the  vector  d(A)  =  A 1,  i.e.,  the  vector  obtained  by  sorting  the  entries  of  A 1 
in  descending  order.  It  is  easily  seen  that  d(A)  is  a  graph  invariant.  Consequently  any 
function  of  d(A)  is  also  a  graph  invariant.  However  our  interest  is  in  obtaining  convex 
functions  of  the  adjacency  matrix  A.  An  important  class  of  functions  of  d(A)  that  are 
convex  functions  of  A,  and  therefore  are  convex  graph  invariants,  are  of  the  form: 


f(A)  =  vTd(A), 


for  v  £  Rn  such  that  vi  >  •  •  •  >  vn.  This  function  can  also  be  expressed  as  the 
maximum  over  all  permutations  n  £  Syrn(n)  of  the  inner-product  v7  nAl.  As  described 
in  the  Appendix  such  linear  monotone  functionals  can  be  used  to  express  all  convex 
functions  over  Mn  that  are  invariant  with  respect  to  permutations  of  the  argument. 
Consequently  these  monotone  functions  serve  as  building  blocks  for  constructing  all 
convex  graph  invariants  that  are  functions  of  d(A). 
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Spectral  invariants.  Let  the  eigenvalues  of  the  adjacency  matrix  A  of  a  graph  be 
denoted  as  Ai(A)  >  •  •  •  >  An(A),  and  let  X(A)  =  [Ai(A), . . . ,  An(A)].  These  eigenvalues 
form  the  spectrum  of  the  graph  specified  by  A,  and  clearly  remain  unchanged  under 
transformations  of  the  form  A  — >  V AVT  for  any  orthogonal  matrix  V  E  O (n)  (and 
therefore  for  any  permutation  matrix).  Hence  any  function  of  the  spectrum  of  a  graph 
is  a  graph  invariant.  Analogous  to  the  previous  example,  an  important  class  of  spectral 
functions  that  are  also  convex  are  of  the  form: 

f(A)  =  vtX(A), 

for  v  E  Mn  such  that  Vi  >  •  •  •  >  vn.  We  denote  spectral  invariants  that  are  also 
convex  functions  as  convex  spectral  invariants.  As  with  convex  invariants  of  the  degree 
sequence,  all  convex  spectral  invariants  can  be  constructed  using  monotone  functions 
of  the  type  described  here  (see  the  Appendix) . 

Second-smallest  eigenvalue  of  Laplacian.  This  example  is  only  meaningful 
for  weighted  graphs  in  which  the  node  and  edge  weights  are  non-negative.  For  such  a 
graph  specified  by  adjacency  matrix  A,  let  Da  =  diag(Al),  where  diag  takes  as  input  a 
vector  and  forms  a  diagonal  matrix  with  the  entries  of  the  vector  on  the  diagonal.  The 
Laplacian  of  a  graph  is  then  defined  as  follows: 

La  =  Da  —  A. 


If  A  E  S”  consists  of  nonnegative  entries,  then  La  ir  0.  In  this  setting  we  denote  the 
eigenvalues  of  La  as  Ai(L^)  >  ■  ■  ■  >  A u{La)-  It  is  easily  seen  that  A u(La)  =  0  as  the 
all-ones  vector  1  lies  in  the  kernel  of  La-  The  second-smallest  eigenvalue  An_ i  (La)  of 
the  Laplacian  is  a  concave  invariant  function  of  A.  It  plays  an  important  role  as  the 
graph  specified  by  A  is  connected  if  and  only  if  An_i (La)  >  0. 

Inverse  of  Stability  Number.  A  stable  set  of  an  unweighted  graph  Q  is  a  subset 
of  the  nodes  of  Q  such  that  no  two  nodes  in  the  subset  are  adjacent.  The  stability 
number  is  the  size  of  the  largest  stable  set  of  Q,  and  is  denoted  by  a(Q).  By  a  result  of 
Motzkin  and  Straus  [109],  the  inverse  of  the  stability  number  can  be  written  as  follows: 


1 

W) 


min  xr(I  +  A)x 

X 

s.t.  Xj  >  0,  Vi,  Xj  =  1. 

i 


(6.3) 


Here  A  is  any  adjacency  matrix  representing  the  graph  Q.  Although  this  formulation 
is  for  unweighted  graphs  with  edge  weights  being  either  one  or  zero,  we  note  that 
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the  definition  can  in  fact  be  extended  to  all  weighted  graphs,  i.e. ,  for  graphs  with 
adjacency  matrix  given  by  any  i  G  Sn.  Consequently,  the  inverse  of  this  extended 
stability  number  of  a  graph  is  a  concave  graph  invariant  over  Sn  as  it  is  expressed  as 
the  minimum  over  a  set  of  linear  functions.  As  this  function  is  difficult  to  compute  in 
general  (because  the  stability  number  of  a  graph  is  intractable  to  compute),  one  could 
employ  the  following  tractable  relaxation: 

ft  A)  =  min  TV  (X(I  +  A)) 

xesn 

s.t.  A  >  0,  A  y  o,  1tA1  =  1. 

This  relaxation  is  also  a  concave  graph  invariant  as  it  is  expressed  as  the  minimum  over 
a  set  of  affine  functions. 

■  6.3.4  Examples  of  invariant  convex  sets 

Next  we  provide  examples  of  invariant  convex  sets.  As  described  below  constraints 
expressed  using  such  sets  are  useful  in  order  to  require  that  graphs  have  certain  prop¬ 
erties.  Note  that  a  sublevel  set  {A  :  f(A)  <  a}  for  any  convex  graph  invariant  /  is  an 
invariant  convex  set.  Therefore  all  the  examples  of  convex  graph  invariants  given  above 
can  be  used  to  construct  invariant  convex  set  constraints. 

Algebraic  connectivity  and  diffusion.  As  mentioned  in  Section  6.3.3  a  graph 
represented  by  adjacency  matrix  A  G  A  has  the  property  that  the  second-smallest 
eigenvalue  An_i (La)  of  the  Laplacian  of  the  graph  is  a  concave  graph  invariant.  The 
constraint  set  {A  :  A  £  A,  An_i  (La)  >  e}  for  any  e  >  0  expresses  the  property  that  a 
graph  must  be  connected.  Further  if  we  set  e  to  be  relatively  large,  we  can  require  that 
a  graph  has  good  diffusion  properties. 

Largest  clique  constraint.  Let  I\ &  <E  Sn  denote  the  adjacency  matrix  of  an 
unweighted  /c-clique.  Note  that  K\~  is  only  nonzero  within  a  k  x  k  submatrix,  and  is 
zero-padded  to  lie  in  S”.  Consider  the  following  invariant  convex  set  for  e  >  0: 

{A:AgA,  QKk(A)  <  (k2  -  k)  -  e}. 

This  constraint  set  expresses  the  property  that  a  graph  cannot  have  a  clique  of  size 
k  (or  larger),  with  the  edge  weights  of  all  edges  in  the  clique  being  close  to  one.  For 
example  we  can  use  this  constraint  set  to  require  that  a  graph  has  no  triangles  (with 
large  edge  weights).  It  is  important  to  note  that  triangles  (and  cliques  more  generally) 
are  forbidden  only  with  the  qualification  that  all  the  edge  weights  in  the  triangle  cannot 
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be  close  to  one.  For  example  a  graph  may  contain  a  triangle  with  each  edge  having 
weight  equal  to  In  this  case  the  function  ®k3  evaluates  to  3,  which  is  much  smaller 
than  the  maximum  value  of  6  that  &k3  can  take  for  matrices  in  A  that  contain  a 
triangle  with  edge  weights  equal  to  one. 

Girth  constraint.  The  girth  of  a  graph  is  the  length  of  the  shortest  cycle.  Let 
Ck  G  Sn  denote  the  adjacency  matrix  of  an  unweighted  fc-cycle  for  k  <  n.  As  with  the 
fc-clique  note  that  C\  is  nonzero  only  within  a  k  x  k  submatrix,  and  is  zero-padded  so 
that  it  lies  in  Sn.  In  order  to  express  the  property  that  a  graph  has  no  small  cycles, 
consider  the  following  invariant  convex  set  for  e  >  0: 

{A  :  A  e  A,  QCk  (A)  <2k-e\/k<  k0}. 

Graphs  belonging  to  this  set  cannot  have  cycles  of  length  less  than  or  equal  to  ko,  with 
the  weights  of  edges  in  the  cycle  being  close  to  one.  Thus  we  can  impose  a  lower  bound 
on  a  weighted  version  of  the  girth  of  a  graph. 

Forbidden  subgraph  constraint.  The  previous  two  examples  can  be  viewed  as 
special  cases  of  a  more  general  constraint  involving  forbidden  subgraphs.  Specifically 
let  Aj,  denote  the  adjacency  matrix  of  an  unweighted  graph  on  k  nodes  that  consists 
of  Ek  edges.  As  before  A &  is  zero-padded  to  ensure  that  it  lies  in  Sn.  Consider  the 
following  invariant  convex  set  for  e  >  0: 

{A  :  A  £  A,  @Ak{A)  <  2 Ek  —  e}. 

This  constraint  set  requires  that  a  graph  not  contain  the  subgraph  given  by  the  adja¬ 
cency  matrix  Ak,  with  edge  weights  close  to  one. 

Degree  distribution.  Using  the  notation  described  previously,  let  d(A)  =  A 1 
denote  the  sorted  degree  sequence  (d(A)i  >  ■  ■  ■  >  d(A)n)  of  a  graph  specified  by 
adjacency  matrix  A.  We  wish  to  consider  the  set  of  all  graphs  that  have  degree  sequence 
d(A).  This  set  is  in  general  not  convex  unless  A  represents  a  (weighted)  regular  graph, 
i.e.,  d(A)  =  a  1  for  some  constant  a.  Therefore  we  consider  the  convex  hull  of  all  graphs 
that  have  degree  sequence  given  by  d: 

V(A)  =  con v{B  :  B  €  Sn,  B1  =  d(A)}. 

This  set  is  in  fact  tractable  to  represent,  and  is  given  by  the  set  of  graphs  whose  degree 
sequence  is  majorized  by  d: 

{fc  k 

B  :  B  e  Sn,  1tB1  =  lTd(A),  d(A);  Vfc  =  1, . . . ,  n  -  1 

i=  1  i= 1 
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By  the  majorization  principle  [11]  another  representation  for  this  convex  set  is  as  the 
set  of  graphs  whose  degree  sequence  lies  in  the  permutahedron  generated  by  d  [149]; 
the  permutahedron  generated  by  a  vector  is  the  convex  hull  of  all  permutations  of  the 
vector.  The  notion  of  majorization  is  sometimes  also  referred  to  as  Lorenz  dominance 
(see  the  Appendix  for  more  details). 

Spectral  distribution.  Let  A(A)  denote  the  spectrum  of  a  graph  represented  by 
adjacency  matrix  A.  As  before  we  are  interested  in  the  set  of  all  graphs  that  have 
spectrum  A  (A).  This  set  is  nonconvex  in  general,  unless  A  is  a  multiple  of  the  identity 
matrix  in  which  case  all  the  eigenvalues  are  the  same.  Therefore  we  consider  the  convex 
hull  of  all  graphs  (i.e.,  symmetric  adjacency  matrices)  that  have  spectrum  equal  to 
A(A): 

8(A)  =  con  v{B  :  B  G  Sn,  A  (B)  =  A  (A)}. 

This  convex  hull  also  has  a  tractable  semidefinite  representation  analogous  to  the  de¬ 
scription  above  [11]: 

{k  k 

B  :  B  £  Sn,  Tr (B)  =  Tr(A),  ^  X(B)i  <  X VA;  =  1, . . . ,  n  -  1 

i=  1  4=1 

Note  that  eigenvalues  are  specified  in  descending  order,  so  that  ^4=i  A (B)i  represents 
the  sum  of  the  ^-largest  eigenvalues  of  B. 

■  6.3.5  Representation  of  convex  graph  invariants 

All  invariant  convex  sets  and  convex  graph  invariants  can  be  represented  using  elemen¬ 
tary  convex  graph  invariants.  In  this  section  we  describe  both  these  representation 
results.  Representation  theorems  in  mathematics  give  expressions  of  complicated  sets 
or  functions  in  terms  of  simpler,  basic  objects.  In  functional  analysis  the  Riesz  represen¬ 
tation  theorem  relates  elements  in  a  Hilbert  space  and  its  dual,  by  uniquely  associating 
each  element  of  the  Hilbert  space  to  a  linear  functional  [128].  In  probability  theory 
de  Finetti’s  theorem  states  that  a  collection  of  exchangeable  random  variables  can  be 
expressed  as  a  mixture  of  independent,  identically  distributed  random  variables.  In 
convex  analysis  every  closed  convex  set  can  be  expressed  as  the  intersection  of  halfs¬ 
paces  [124],  In  each  of  these  cases  representation  theorems  provide  a  powerful  analysis 
tool  as  they  give  a  canonical  expression  for  complicated  mathematical  objects  in  terms 
of  elementary  sets/functions. 
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First  we  give  a  representation  result  for  convex  graph  invariants.  In  order  to  get 
a  flavor  of  this  result  consider  the  maximum  absolute-value  node  weight  invariant  of 
Section  6.3.3,  which  is  represented  as  the  supremum  over  two  elementary  convex  graph 
invariants.  The  following  theorem  states  that  in  fact  any  convex  graph  invariant  can 
be  expressed  as  a  supremum  over  elementary  invariants: 

Theorem  6.3.1.  Let  f  be  any  convex  graph  invariant.  Then  f  can  be  expressed  as 
follows: 

f(A)  =  sup  Op  (A)  -  aPf 
Per 

for  ap  €  R  and  for  some  subset  V  C  Sn. 

Proof  Since  /  is  a  convex  function,  it  can  be  expressed  as  the  supremum  over  linear 
functionals  as  follows: 

f(A)  =  sup  Tr (PA)  -  aPl 

Per  csn 

for  a?  £  1.  This  conclusion  follows  directly  from  the  separation  theorem  in  convex 
analysis  [124];  in  particular  this  description  of  the  convex  function  /  can  be  viewed  as  a 
specification  in  terms  of  supporting  hyperplanes  of  the  epigraph  of  /,  which  is  a  convex 
subset  of  Sn  x  R.  However  as  /  is  also  a  graph  invariant,  we  have  that  f(A)  =  /(nnnr) 
for  any  permutation  n  and  for  all  A  £  Sn.  Consequently  for  any  permutation  n  and 
for  any  P  GV , 

f(A)  =  f{UAnT)  >  Tr(pnnnT)  -  ap. 

Thus  we  have  that 

f{A)  >  sup  Qp(A)  —  ap.  (6-5) 

Per 

However  it  also  clear  that  for  each  P  G  V 


®p(A)  —  ap  >  Tr(PT)  —  ap, 


which  allows  us  to  conclude  that 

sup  @p(A)  —  ap  >  sup  Tr(PT)  —  ap  =  f{A).  (6-6) 

Per  Per 

Combining  equations  (6.5)  and  (6.6)  we  have  the  desired  result.  □ 

Remark  6.3.2.  This  result  can  be  strengthened  in  the  sense  that  one  need  only  consider 
elements  in  V  that  lie  in  different  equivalence  classes  up  to  conjugation  by  permutation 
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matrices  II  6  Sym(n).  In  each  equivalence  class  the  representative  functional  is  the  one 
with  the  smallest  value  of  ap.  This  idea  can  be  formalized  as  follows.  Consider  the 
group  action  p  :  (M,  II)  >— ►  IIAf IIT  that  conjugates  elements  in  Sn  by  a  permutation 
matrix  in  Sym(n).  With  this  notation  we  may  restrict  our  attention  in  Theorem  6.3.1 
to  V  C  S”/Sym(n),  where  S™/Sym(n)  represents  the  quotient  space  under  the  group 
action  p.  Such  a  mathematical  object  obtained  by  taking  the  quotient  of  a  Euclidean 
space  (or  more  generally  a  smooth  manifold)  under  the  action  of  a  finite  group  is  called 
an  orbifold.  With  this  strengthening  one  can  show  that  there  exists  a  unique,  minimal 
representation  set  V  C  S”/Sym(n).  We  however  do  not  emphasize  such  refinements  in 
subsequent  results,  and  stick  with  the  weaker  statement  that  V  C  Sn  for  notational  and 
conceptual  simplicity. 

As  our  next  result  we  show  that  any  invariant  convex  set  can  be  represented  as  the 
intersection  of  sublevel  sets  of  elementary  convex  graph  invariants: 

Proposition  6.3.1.  Let  S  C  Sn  be  an  invariant  convex  set.  Then  there  exists  a 
representation  of  S  as  follows: 

S  =  p|  {A  :  A  £  Sn,  0P(A)  <  «p}, 

PGP 

for  some  V  C  Sn  and  for  ap  €  M. 

Proof.  The  proof  of  this  statement  proceeds  in  an  analogous  manner  to  that  of  Theo¬ 
rem  6.3.1,  and  is  again  essentially  a  consequence  of  the  separation  theorem  in  convex 
analysis.  □ 

■  6.3.6  A  Robust  Optimization  view  of  invariant  convex  sets 

Uncertainty  arises  in  many  real-world  problems.  An  important  goal  in  robust  opti¬ 
mization  (see  [10]  and  the  reference  therein)  is  to  translate  formal  notions  of  measures 
of  uncertainty  into  convex  constraint  sets.  Convexity  is  important  in  order  to  obtain 
optimization  formulations  that  are  tractable. 

The  representation  of  a  graph  via  an  adjacency  matrix  in  Sn  is  inherently  uncertain 
as  we  have  no  information  about  the  specific  labeling  of  the  nodes  of  the  graph.  In  this 
section  we  associate  to  each  graph  a  convex  polytope,  which  represents  the  best  convex 
uncertainty  set  given  a  graph: 
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Definition  6.3.4.  Let  Q  be  a  graph  that  is  represented  by  an  adjacency  matrix  A  £  Sn 
(any  choice  of  representation  is  suitable).  The  convex  hull  of  the  graph  Q  is  defined  as 
the  following  convex  polytope: 

C(Q)  =  conv{IMIIT  :  II  €  Sym(n)}. 

Recall  that  Sym(n)  is  the  symmetric  group  of  n  x  n  permutation  matrices.  One  can 
check  that  the  convex  hull  of  a  graph  is  an  invariant  convex  set,  and  that  its  extreme 
points  are  the  matrices  IL4II7  for  all  II  6  Sym(n).  Note  that  this  convex  hull  may  in 
general  be  intractable  to  characterize;  if  these  polytopes  were  tractable  to  characterize 
we  would  be  able  to  solve  the  graph  isomorphism  problem  in  polynomial  time. 

The  convex  hull  of  a  graph  is  the  smallest  convex  set  that  contains  all  the  adjacency 
matrices  that  represent  the  graph.  Therefore  C(Q)  is  in  some  sense  the  “best  convex 
characterization”  of  the  graph  Q .  This  notion  is  related  to  the  concept  of  risk  measures 
studied  in  [6],  and  the  construction  of  convex  uncertainty  sets  based  on  these  risk 
measures  studied  in  [14].  In  particular  we  recall  the  following  definition  from  [14]: 

Definition  6.3.5.  Let  Z  =  {Z\ , . . . ,  Z &}  be  any  finite  collection  of  elements  with  Zt  £ 
Sra.  Let  q  €  be  a  probability  distribution,  i.e.,  ]TV  q,  =  1  and  q,  >  0,  Vi.  Then  the 
c \-permutohull  is  the  polytope  in  Sn  defined  as  follows: 

BqfZ)  =  conv  {e^  :  II  £  Sym(fc) 

Convex  uncertainty  sets  given  by  permutohulls  emphasize  a  data-driven  view  of  ro¬ 
bust  optimization  as  adopted  in  [14].  Specifically  the  only  information  available  about 
an  uncertain  set  in  many  settings  is  a  finite  collection  of  data  vectors  Z ,  and  the  prob¬ 
ability  distribution  q  expresses  preferences  over  such  an  unordered  data  set.  Therefore 
given  a  data  set  and  a  probability  distribution  that  quantifies  uncertainty  with  respect 
to  elements  of  this  data  set,  the  q-permutohull  is  the  smallest  convex  set  expressing 
these  uncertainty  preferences.  We  note  that  an  important  property  of  a  permutohull  is 
that  it  is  invariant  with  respect  to  relabeling  of  the  data  vectors  in  Z. 

The  convex  hull  of  a  graph  C(G)  is  a  simple  example  of  a  permutohull  BqfZ),  with 
the  distribution  being  q  =  (1,0, ...  ,0)  and  the  set  Z  =  {nyfflT  :  II  £  Sym(n)}  where 
A  £  Sn  represents  the  graph  Q .  More  complicated  permutohulls  of  graphs  may  be  of 
interest  in  several  applications  but  we  do  not  pursue  these  generalizations  here,  and 
instead  focus  on  the  case  of  the  convex  hull  of  a  graph  as  defined  above. 
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The  convex  hull  of  a  graph  is  itself  an  invariant  convex  set  by  definition.  Therefore 
we  can  appeal  to  Proposition  6.3.1  to  give  a  representation  of  this  set  in  terms  of 
sublevel  sets  of  elementary  convex  graph  invariants.  As  our  next  result  we  show  that 
the  values  of  all  elementary  convex  graph  invariants  of  Q  can  be  used  to  produce  such 
a  representation: 

Proposition  6.3.2.  Let  Q  be  a  graph  and  let  A  £  Sn  be  an  adjacency  matrix  repre¬ 
senting  Q .  We  then  have  that 

C(G)  =  P|  {B  :  B  £  Sn,  GP(B)  <  QP(A)}. 

Pesn 

Proof.  One  direction  of  inclusion  in  this  result  is  easily  seen.  Indeed  we  have  that  for 
any  II  £  Syrn(n) 

IL4IIT  £  p|  {B  :  B  £  Sn,  @P{B )  <  0P(A)}. 

PeSn 

As  the  right-hand-side  is  a  convex  set  it  is  clear  that  the  convex  hull  C(G)  belongs  to 
the  set  on  the  right-hand-side: 

C(6)  C  p|  {B  :  B  £  Sn,  QP(B )  <  QP(A)}. 

Pe  s™ 

For  the  other  direction  suppose  for  the  sake  of  a  contradiction  that  we  have  a  point 
M  0  C{Q)  but  with  0p(M )  <  O P(A)  for  all  P  £  Sn.  As  M  0  C(G)  we  appeal  to  the 
separation  theorem  from  convex  analysis  [124]  to  produce  a  strict  separating  hyperplane 
between  M  and  C(G ),  i.e. ,  a  P  £  Sn  such  that 

Tr(PB)  <  a,  MB  £  C(S),  and  Tr(PM)  >  a. 

Further  as  C(Q)  is  an  invariant  convex  set,  it  must  be  the  case  that 

@p{B)  <  a, MB  £  C(G). 

On  the  other  hand  as  Tr (PM)  >  a  we  also  have  that  Qp(M)  >  a.  It  is  thus  clear  that 

O p(A)  <a<  0p(M), 


which  leads  us  to  a  contradiction  and  concludes  the  proof. 


□ 
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Therefore  elementary  convex  graph  invariants  are  useful  for  representing  all  the 
“convex  properties”  of  a  graph.  This  result  agrees  with  the  intuition  that  the  “maximum 
amount  of  information”  that  one  can  hope  to  obtain  from  convex  graph  invariants  about 
a  graph  should  be  limited  fundamentally  by  the  convex  hull  of  the  graph. 

As  mentioned  previously  in  many  cases  the  convex  hull  of  a  graph  may  be  intractable 
to  characterize.  One  can  obtain  outer  bounds  to  this  convex  hull  by  using  a  tractable 
subset  of  elementary  convex  graph  invariants;  therefore  we  may  obtain  tractable  but 
weaker  convex  uncertainty  sets  than  the  convex  hull  of  a  graph.  From  Proposition  6.3.2 
such  approximations  can  be  refined  as  we  use  additional  elementary  convex  graph  in¬ 
variants.  As  an  example  the  spectral  convex  constraint  sets  described  in  Section  6.3.4 
provide  a  tractable  relaxation  that  plays  a  prominent  role  in  our  experiments  in  Sec¬ 
tion  6.4. 

■  6.3.7  Comparison  with  spectral  invariants 

Convex  functions  that  are  invariant  under  certain  group  actions  have  been  studied 
previously.  The  most  prominent  among  these  is  the  set  of  convex  functions  of  symmetric 
matrices  that  are  invariant  under  conjugation  by  orthogonal  matrices  [42]: 

f(M)  =  f(VMVT),  V  M  €  Sn,  V  V  €  O(n). 

It  is  clear  that  such  functions  depend  only  on  the  spectrum  of  a  symmetric  matrix,  and 
therefore  we  refer  to  them  as  convex  spectral  invariants: 

f(M)  =  /(A(M)), 

where  /  :  Mn  — >  M.  It  is  shown  in  [42]  that  /  is  convex  if  and  only  if  /  is  a  convex 
function  that  is  symmetric  in  its  argument: 

/(x)  =  /( IIx),  Vx  e  Mn,Vn  €  Syrn(n). 

One  can  check  that  any  convex  spectral  invariant  can  be  represented  as  the  supremum 
over  monotone  functionals  of  the  spectrum  of  the  form: 

/(x)  =  vTx-a, 

for  v  E  Rn  such  that  vi  >  •  •  •  >  vn.  See  the  Appendix  for  more  details. 

A  convex  spectral  invariant  is  also  a  convex  graph  invariant  as  invariance  with  re¬ 
spect  to  conjugation  by  any  orthogonal  matrix  is  a  stronger  requirement  than  invariance 
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with  respect  to  conjugation  by  any  permutation  matrix.  As  many  convex  spectral  in¬ 
variants  are  tractable  to  compute,  they  form  an  important  subclass  of  convex  graph 
invariants.  In  Section  6.4.1  we  discuss  a  natural  approximation  to  elementary  convex 
graph  invariants  using  convex  spectral  invariants  by  replacing  the  symmetric  group 
Sym(n)  in  the  maximization  by  the  orthogonal  group  O(n).  Finally  one  can  define  a 
spectrally  invariant  convex  set  S  (analogous  to  invariant  convex  sets  defined  in  Sec¬ 
tion  6.3.2)  in  which  M  E  S  if  and  only  if  VMVT  €  S  for  all  V  €  O(n).  Such  sets  are 
very  useful  in  order  to  impose  various  spectral  constraints  on  graphs,  and  often  have 
tractable  semidefinite  representations. 

■  6.3.8  Convex  versus  non-convex  invariants 

There  are  many  graph  invariants  that  are  not  convex.  In  this  section  we  give  two  ex¬ 
amples  that  serve  to  illustrate  the  strengths  and  weaknesses  of  convex  graph  invariants. 
First  consider  the  spectral  invariant  given  by  the  fifth  largest  eigenvalue  of  a  graph,  i.e., 
As(A)  for  a  graph  specified  by  adjacency  matrix  A.  This  function  is  a  graph  invariant 
but  it  is  not  convex.  However  from  Section  6.3.3  we  have  that  the  sum  of  the  first  five 
eigenvalues  of  a  graph  is  a  convex  graph  invariant.  More  generally  any  function  of  the 
form  uiAi  +  •  •  •  +  U5A5  with  v\  >  ■  ■  ■  >  v$  is  a  convex  graph  invariant.  Thus  information 
about  the  fifth  eigenvalue  can  be  obtained  in  a  “convex  manner”  only  by  including  in¬ 
formation  about  all  the  top  five  eigenvalues  (or  all  the  bottom  n  —  4  eigenvalues) .  As  a 
second  example  consider  the  (weighted)  sum  of  the  total  number  of  triangles  that  occur 
as  subgraphs  in  a  graph.  This  function  is  again  a  non-convex  graph  invariant.  However 
recall  from  the  forbidden  subgraph  example  in  Section  6.3.4  that  we  can  use  elementary 
convex  graph  invariants  to  test  whether  a  graph  contains  a  triangle  as  a  subgraph  (with 
the  edges  of  the  triangle  having  large  weights).  Therefore,  roughly  speaking  convex 
graph  invariants  can  be  used  to  decide  whether  a  graph  contains  a  triangle,  while  gen¬ 
eral  non-convex  graph  invariants  can  provide  more  information  about  the  total  number 
of  triangles  in  a  graph.  These  examples  demonstrate  that  convex  graph  invariants  have 
certain  limitations  in  terms  of  the  type  of  information  that  they  can  convey  about  a 
graph. 

The  weaker  form  of  information  about  a  graph  conveyed  by  convex  graph  invari¬ 
ants  is  nonetheless  still  useful  in  distinguishing  between  graphs.  As  the  next  result 
demonstrates  convex  graph  invariants  are  strong  enough  to  distinguish  between  non¬ 
isomorphic  graphs.  This  lemma  follows  from  a  straightforward  application  of  Proposi- 
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tion  6.3.2: 

Lemma  6.3.1.  Let  Gi,G-2  be  two  non-isomorphic  graphs  represented  by  adjacency  ma¬ 
trices  Ai,A’2  £  Sn,  i.e.,  there  exists  no  permutation  II  £  Sym(n)  such  that  Ai  = 
IL42nT.  Then  there  exists  aPgS"  such  that  @p(Ai)  /  @p(A2). 

Proof.  Assume  for  the  sake  of  a  contradiction  that  0p(Ai)  =  Op(A2)  for  all  P  £  Sn. 
Then  we  have  from  Proposition  6.3.2  that  C(Gi)  =  C(^2)-  As  the  extreme  points  of 
these  polytopes  must  be  the  same,  there  must  exist  a  permutation  II  £  Sym(n)  such 
that  A\  =  IL42IIT.  This  leads  to  a  contradiction.  □ 

Hence  for  any  two  given  non-isomorphic  graphs  there  exists  an  elementary  convex 
graph  invariant  that  evaluates  to  different  values  for  these  two  graphs.  Consequently 
elementary  convex  graph  invariants  form  a  complete  set  of  graph  invariants  as  they  can 
distinguish  between  any  two  non-isomorphic  graphs. 

■  6.4  Computing  Convex  Graph  Invariants 

In  this  section  we  focus  on  efficiently  computing  and  approximating  convex  graph  invari¬ 
ants,  and  on  tractable  representations  of  invariant  convex  sets.  We  begin  by  studying 
the  question  of  computing  elementary  convex  graph  invariants,  before  moving  on  to 
more  general  convex  invariants. 

■  6.4.1  Elementary  invariants  and  the  Quadratic  Assignment  problem 

As  all  convex  graph  invariants  can  be  represented  using  only  elementary  invariants,  we 
initially  focus  on  computing  the  latter.  Computing  an  elementary  convex  graph  invari¬ 
ant  @p(A)  for  general  A,  P  is  equivalent  to  solving  the  so-called  Quadratic  Assignment 
Problem  (QAP)  [32],  Solving  QAP  is  hard  in  general,  because  it  includes  as  a  special 
case  the  Hamiltonian  cycle  problem;  if  P  is  the  adjacency  matrix  of  the  n-cycle,  then  for 
an  unweighted  graph  specified  by  adjacency  matrix  A  we  have  that  0p(A)  is  equal  to 
2 n  if  and  only  if  the  graph  contains  a  Hamiltonian  cycle.  However  there  are  well-studied 
spectral  and  semidefinite  relaxations  for  QAP,  which  we  discuss  next. 

The  spectral  relaxation  of  0p(A)  is  obtained  by  replacing  the  symmetric  group 
Sym(n)  in  the  definition  by  the  orthogonal  group  O(n): 

Ap(A)  =  max  Tv{PVAVt). 

VeO  (ra) 


(6.7) 
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Clearly  Qp(A)  <  A p{A)  for  all  A,  P  £  Sn.  As  one  might  expect  A p{A)  has  a  simple 
closed- form  solution  [67]: 

A  P(A)  =  X(P)TX(A),  (6.8) 

where  A(A),  X(P)  are  the  eigenvalues  of  A,  P  sorted  in  descending  order. 

The  spectral  relaxation  offers  a  simple  bound,  but  is  quite  weak  in  many  instances. 
Next  we  consider  the  well-studied  semidefinite  relaxation  for  the  QAP,  which  offers  a 
tighter  relaxation  [147].  The  main  idea  behind  the  semidefinite  relaxation  is  that  we 
can  linearize  @p(A)  as  follows: 

@p(A )  =  max  Tr(PnAnr) 

neSym(n) 

=  max  (x,  ( A  (g)  P)x) 

xeR"2  ,x=vec(n),neSym(n) 

=  max  Tr((A  (g)  P)xxr). 

xgR™2  ,x=vec(n),neSym(n) 

Here  A®P  denotes  the  tensor  product  between  A  and  P,  and  vec  denotes  the  operation 
that  stacks  the  columns  of  a  matrix  into  a  single  vector.  Consequently  it  is  of  interest 
to  characterize  the  following  convex  hull: 

convjxx7  :  x  £  M”2,  x  =  vec(n),  n  £  Syrn(n)}. 

There  is  no  known  tractable  characterization  of  this  set,  and  by  considering  tractable 
approximations  the  semidefinite  relaxation  to  Qp(A)  is  then  obtained  as  follows: 


We  refer  the  reader  to  [147]  for  the  detailed  steps  involved  in  the  construction  of  this 
relaxation.  This  SDP  relaxation  gives  an  upper  bound  to  Qp(A),  i.e. ,  Hp(A)  >  Qp(A). 
One  can  show  that  if  the  extra  rank  constraint 


=  1 
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is  added  to  the  SDP  (6.9),  then  Qp(A)  =  @p(A).  Therefore  if  the  optimal  value  of  the 
SDP  (6.9)  is  achieved  at  some  y,  Y  such  that  this  rank-one  constraint  is  satisfied,  then 
the  relaxation  is  tight,  i.e.,  we  would  have  that  Qp(A)  =  Qp(A). 

While  the  semidefinite  relaxation  (6.9)  can  in  principle  be  computed  in  polynomial¬ 
time,  the  size  of  the  variable  Y  £  S (n2)  means  that  even  moderate  size  problem  in¬ 
stances  are  not  well-suited  to  solution  by  interior-point  methods.  In  many  practical 
situations  however,  we  often  have  that  the  matrix  P  £  Sn  represents  the  adjacency  ma¬ 
trix  of  some  small  graph  on  k  nodes  with  fc<n,  i.e.,  P  is  nonzero  only  inside  a  k  x  k 
submatrix  and  is  zero-padded  elsewhere  so  that  it  lies  in  S™.  For  example  as  discussed 
in  Section  6.3.4,  P  may  represent  the  adjacency  matrix  of  a  triangle  in  a  constraint  ex¬ 
pressing  that  a  graph  is  triangle-free.  In  such  cases  computing  or  approximating  Qp(A) 
may  be  done  more  efficiently  as  follows: 

1.  Combinatorial  enumeration.  For  very  small  values  of  k  it  is  possible  to  com¬ 
pute  Qp(A)  efficiently  even  by  explicit  combinatorial  enumeration.  The  complex¬ 
ity  of  such  a  procedure  scales  as  0(nk).  This  approach  may  be  suitable  if,  for 
example,  P  represents  the  adjacency  matrix  of  a  triangle. 

2.  Symmetry  reduction.  For  larger  values  of  k,  combinatorial  enumeration  may 
no  longer  be  appropriate.  In  these  cases  the  special  structure  in  P  can  be  exploited 
to  reduce  the  size  of  the  SDP  relaxation  (6.9).  Specifically,  using  the  methods 
described  in  [43]  it  is  possible  to  reduce  the  size  of  the  matrix  variables  from 
0(n2)  x  0(n2)  to  size  0(kn )  x  0{kn).  More  generally,  it  is  also  possible  to  exploit 
group  symmetry  in  P  to  similarly  reduce  the  size  of  the  SDP  (6.9)  (see  [43]  for 
details). 

■  6.4.2  Other  methods  and  computational  issues 

In  many  special  cases  in  which  computing  convex  graph  invariants  may  be  intractable, 
it  is  also  possible  to  use  other  types  of  tractable  semidefinite  relaxations.  As  described 
in  Section  6.3.3  the  MAXCUT  value  and  the  inverse  stability  number  of  graphs  are 
invariants  that  are  respectively  convex  and  concave.  However  both  of  these  are  in¬ 
tractable  to  compute,  and  as  a  result  we  must  employ  the  SDP  relaxations  for  these 
invariants  as  discussed  in  Section  6.3.3. 

Another  issue  that  arises  in  practice  is  the  representation  of  invariant  convex  sets. 
As  an  example,  let  f(A)  denote  the  SDP  relaxation  of  the  MAXCUT  value  as  defined 
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in  (6.2).  As  f(A)  is  a  concave  graph  invariant,  we  may  be  interested  in  representing 
convex  constraint  sets  as  follows: 

{A  :  A  £  Sn,  f(A)  >  a}  =  {A  :  A  £  Sn,  Tr (XA)  >  a  VX  £  Sn  s.t.  Xu  =  1,  X  t  0}. 

In  order  to  computationally  represent  such  a  set  specified  in  terms  of  a  universal  quan¬ 
tifier,  we  appeal  to  convex  duality.  Using  the  standard  dual  formulation  of  (6.2),  we 
have  that: 

{A  :  A  £  Sn,  f(A)  >  a}  =  {A  :  A  £  Sn,  3F  diagonal  s.t.  A  t  Y,  TY(Y)  >  a}. 

This  reformulation  provides  a  description  in  terms  of  existential  quantifiers  that  is  more 
suitable  for  practical  representation.  Such  reformulations  using  convex  duality  are  well- 
known,  and  can  be  employed  more  generally  (e.g.,  for  invariant  convex  sets  specified  by 
sublevel  sets  of  the  inverse  stability  number  or  its  relaxations  in  Section  6.3.3) 

■  6.5  Using  Convex  Graph  Invariants  in  Applications 

In  this  section  we  give  solutions  to  the  stylized  problems  of  Section  6.2  using  convex 
graph  invariants.  In  order  to  properly  state  our  results  we  begin  with  a  few  definitions. 
All  the  convex  programs  in  our  numerical  experiments  are  solved  using  a  combination 
of  the  SDPT3  package  [136]  and  the  YALMIP  parser  [98].  Finally  a  key  property  of 
normal  cones  that  we  use  in  stating  our  results  is  that  for  any  convex  set  C  C  Sn,  the 
normal  cones  at  all  the  extreme  points  of  C  form  a  partition 1  of  Sn  [124]. 

■  6.5.1  Application:  Graph  deconvolution 

Given  a  combination  of  two  graphs  overlaid  on  the  same  set  of  nodes,  the  graph  decon¬ 
volution  problem  is  to  recover  the  individual  graphs  (as  introduced  in  Section  6.2.1). 

Problem  1.  Let  Q\  and  Q 2  be  two  graphs  specified  by  particular  adjacency  matrices 
A\ ,  A2  £  Sn .  We  are  given  the  sum  A  =  A\  +  A2 ,  and  the  additional  information  that 
A\,A2  correspond  to  particular  realizations  (labelings  of  nodes)  of  f/i ,  f/2 •  The  goal  is 
to  recover  A\  and  A2  from  A. 

See  Figure  6.1  for  an  example  illustrating  this  problem.  The  key  unknown  in  this 
problem  is  the  specific  labeling  of  the  nodes  of  Q\  and  Q2  relative  to  each  other  in 

1Note  that  there  may  be  overlap  on  the  boundaries  of  the  normal  cones  at  the  extreme  points,  but 
these  overlaps  have  smaller  dimension  than  those  of  the  normal  cones. 
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the  composite  graph  represented  by  A.  As  described  in  Section  6.3.6,  the  best  convex 
constraints  that  express  this  uncertainty  are  the  convex  hulls  of  the  graphs  Q\,Q-2- 
Therefore  we  consider  the  following  natural  solution  based  on  convex  optimization  to 
solve  the  deconvolution  problem: 


Solution  1.  Recall  that  C(Q i)  and  C(Q2)  are  the  convex  hulls  of  the  unlabeled  graphs 
Qi,Q-2  (which  we  are  given),  and  that  ||  •  || p  denotes  the  Euclidean  (Frobenius)  norm. 
We  propose  the  following  convex  program  to  recover  A  i ,  A2 : 


(Ai,A2)  =  arg  min  \\A  —  A\  —  A2\\f 

Ai,A2eS™ 

s.t.  A\  G  C(Q i),  A2  G  C(Q2)- 


(6.10) 


One  could  also  use  in  the  objective  any  other  norm  that  is  invariant  under  conjugation 
by  permutation  matrices.  This  program  is  convex,  although  it  may  not  be  tractable  if 
the  sets  C(Qi),  C(Q2)  cannot  be  efficiently  represented.  Therefore  it  may  be  desirable  to 
use  tractable  convex  relaxations  C\ ,  C2  of  the  sets  C(Qi),C(Q2),  ie.;  C(Q\)  C  C\  C  Sn 
and  C(Q2)  C  C2  C  Sn: 


( A  i ,  A2 )  =  arg  min 

Ai,A26S" 


||A  —  A\  —  A2\\f 


s.t.  A\  G  Ci,  A2  G  C'2- 


(6.11) 


Recall  from  Proposition  6.3.2  that  we  can  represent  C(Q)  using  all  the  elementary 
convex  graph  invariants.  Tractable  relaxations  to  this  convex  hull  may  be  obtained, 
for  example,  by  just  using  spectral  invariants,  degree-sequence  invariants,  or  any  other 
subset  of  invariant  convex  set  constraints  that  can  be  expressed  efficiently.  We  give 
numerical  examples  later  in  this  section.  The  following  result  gives  conditions  under 
which  we  can  exactly  recover  Aj,  A\  using  the  convex  program  (6.11): 

Proposition  6.5.1.  Given  the  problem  setup  as  described  above,  we  have  that  (Ai,  A2)  = 
(A^Aj)  is  the  unique  optimum  of  (6.11)  if  and  only  if: 


TCl(A*i)n-TC2(A*2)  =  {0}, 


where  —Tc2{A2)  denotes  the  negative  of  the  tangent  cone  Tc2(A 2). 

Proof.  Note  that  in  the  setup  described  above  (A\,A2)  is  an  optimal  solution  of  the 
convex  program  (6.11)  as  this  point  is  feasible  (since  by  construction  A\  G  C(Q i)  C  C\ 
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Cycle  Clebsch  graph  Shrikhande  graph 


Figure  6.3.  The  three  graphs  used  in  the  deconvolution  experiments  of  Section  6.5.1.  The  Cleb¬ 
sch  graph  and  the  Shrikhande  graph  are  examples  of  strongly  regular  graphs  on  16  nodes  [70];  see 
Section  6.5.1  for  more  details  about  the  properties  of  such  graphs. 


and  A2  G  C (Q2)  Q  C2),  and  the  cost  function  achieves  its  minimum  at  this  point.  This 
result  is  concerned  with  (A\,A?f)  being  the  unique  optimal  solution. 

For  one  direction  suppose  that  7b,  (A\)  n  —  Tq2{A2)  =  {0}.  Then  there  exists  no 
Z\  G  Tc1(Al),  Z2  G  Tc2(A2)  such  that  Z\  +  Z2  =  0  with  Z\  7^  0,  Z2  b  0.  Consequently 
every  feasible  direction  from  (A\,A?f)  into  C±  x  C2  would  increase  the  value  of  the 
objective.  Thus  (A\,A2)  *s  the  unique  optimum  of  (6.11). 

For  the  other  direction  suppose  that  {A\,A2)  is  the  unique  optimum  of  (6.11),  and 
assume  for  the  sake  of  a  contradiction  that  Tc1(A\)  n  — 7b2(> i2)  contains  a  nonzero 
element,  which  we’ll  denote  by  Z.  There  exists  a  scalar  a  >  0  such  that  A\  +  aZ  G  C\ 
and  A2  —  aZ  G  C2 ■  Consequently  {A*  +  aZ,  A2  —  aZ)  is  also  a  feasible  solution  that 
achieves  the  lowest  possible  cost  of  zero.  This  contradicts  the  assumption  that  (A*,  A2) 
is  the  unique  optimum.  □ 

Thus  we  have  that  transverse  intersection  of  the  tangent  cones  Tqx  (^4*)  and  —  Tq2  ( A2 ) 
is  equivalent  to  exact  recovery  of  (^,^2)  given  the  sum  A  =  A\  +  A2.  As  C(Q  1)  C  C\ 
and  C (G2)  C  C2,  we  have  that  ^(C/oC^i)  b  Tcx{A\)  and  T^g2)  C  Tc2(A2).  These 
relations  follow  from  the  fact  that  the  set  of  feasible  directions  from  A\  and  A2  into  the 
respective  convex  sets  is  enlarged.  Therefore  the  tangent  cone  transversality  condition 
of  Proposition  6.5.1  is  generally  more  difficult  to  satisfy  if  we  use  relaxations  ,  C2  to 
the  convex  hulls  C(Qi),  C(^)-  Consequently  we  have  a  tradeoff  between  the  complexity 
of  solving  the  convex  program,  and  the  possibility  of  exactly  recovering  (A*,  Alf).  How¬ 
ever  the  following  example  suggests  that  it  is  possible  to  obtain  tractable  relaxations 
that  still  allow  for  perfect  recovery. 
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Example.  We  consider  the  16-cycle,  the  Shrikhande  graph,  and  the  Clebsch  graph 
(see  Figure  6.3),  and  investigate  the  deconvolution  problem  for  all  three  pairings  of  these 
graphs.  For  illustration  purposes  suppose  A\  is  an  adjacency  matrix  of  the  unweighted 
16-node  cycle  denoted  Q\ .  and  that  A*2  is  an  adjacency  matrix  of  the  16-node  Clebsch 
graph  denoted  G2  (see  Figure  6.1).  These  adjacency  matrices  are  random  instances 
chosen  from  the  set  of  all  valid  adjacency  matrices  that  represent  the  graphs  Gi,G-2- 
Given  the  sum  A  =  A\  +  A2,  we  construct  convex  constraint  sets  C\ .  C2  as  follows: 

Ci  =  A  n  £(A\) 

C2  =  A  n  £{A*2). 

Here  £  (A)  represents  the  spectral  constraints  of  Section  6.3.4.  Therefore  the  graphs  Q\ 
and  G2  are  characterized  purely  by  their  spectral  properties.  By  running  the  convex 
program  described  above  for  100  random  choices  of  labelings  of  the  vertices  of  the  graphs 
G11G2,  we  obtained  exact  recovery  of  the  adjacency  matrices  {A\,A2)  in  all  cases  (see 
Table  6.1).  Thus  we  have  exact  decomposition  based  only  on  convex  spectral  constraints, 
in  which  the  only  invariant  information  used  to  characterize  the  component  graphs  Q\,  G2 
are  the  spectra  of  Gi,G2-  Similarly  successful  decomposition  results  using  only  spectral 
invariants  are  also  seen  in  the  cycle/ Shrikhande  graph  deconvolution  problem,  and 
the  Clebsch  graph/Shrikhande  graph  deconvolution  problem;  Table  6.1  gives  complete 
results. 

The  inspiration  for  using  the  Clebsch  graph  and  the  Shrikhande  graph  as  examples 
for  deconvolution  is  based  on  Proposition  6.5.1.  Specifically,  a  graph  for  which  the 
tangent  cone  with  respect  to  the  corresponding  spectral  constraint  set  £(A)  (defined 
in  Section  6.3.4)  is  small  is  well-suited  to  being  deconvolved  from  other  graphs  using 
spectral  invariants.  This  is  because  the  tangent  cone  being  smaller  implies  that  the 
transversality  condition  of  Proposition  6.5.1  is  easier  to  satisfy.  In  order  to  obtain 
small  tangent  cones  with  respect  to  spectral  constraint  sets,  we  seek  graphs  that  have 
many  repeated  eigenvalues.  Strongly  regular  graphs,  such  as  the  Clebsch  graph  and  the 
Shrikhande  graph,  are  prominent  examples  of  graphs  with  repeated  eigenvalues  as  they 
have  only  three  distinct  eigenvalues.  A  strongly  regular  graph  is  an  unweighted  regular 
graph  (i.e.,  each  node  has  the  same  degree)  in  which  every  pair  of  adjacent  vertices  have 
the  same  number  of  common  neighbors,  and  every  pair  of  non-adjacent  vertices  have 
the  same  number  of  common  neighbors  [70].  We  explore  in  more  detail  the  properties 
of  these  and  other  graph  classes  in  a  separate  report  [35] ,  where  we  characterize  families 
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Underlying  graphs 

ff  successes  in  100  random  trials 

The  16-cycle  and  the  Clebsch  graph 

100 

The  16-cycle  and  the  Shrikhande  graph 

96 

The  Clebsch  graph  and  the  Shrikhande  graph 

94 

Table  6.1.  A  summary  of  the  results  of  graph  deconvolution  via  convex  optimization:  We  generated  100 
random  instances  of  each  deconvolution  problem  by  randomizing  over  the  labelings  of  the  components. 
The  convex  program  uses  only  spectral  invariants  to  characterize  the  convex  hulls  of  the  component 
graphs,  as  described  in  Section  6.5.1. 

of  graphs  for  which  the  transverse  intersection  condition  of  Proposition  6.5.1  provably 
holds  for  constraint  sets  C \ ,  C 2  constructed  using  tractable  graph  invariants. 

■  6.5.2  Application:  Generating  graphs  with  desired  properties 

We  first  consider  the  problem  of  constructing  a  graph  with  certain  desired  structural 
properties. 

Problem  2.  Suppose  we  are  given  structural  constraints  on  a  graph  in  terms  of  a 
collection  of  (possibly  nonconvex)  graph  invariants  { hj(A )  =  ay } .  Can  we  recover  a 
graph  that  is  consistent  with  these  constraints?  For  example  we  may  be  given  constraints 
on  the  spectrum,  the  degree  distribution,  the  girth,  and  the  MAXCUT  value.  Can  we 
construct  some  graph  Q  that  is  consistent  with  this  knowledge? 

This  problem  may  be  infeasible  in  that  there  may  no  graph  consistent  with  the  given 
information.  We  do  not  address  this  feasibility  question  here,  and  instead  focus  only 
on  the  computational  problem  of  generating  graphs  that  satisfy  the  given  constraints 
assuming  such  graphs  do  exist.  Next  we  propose  a  convex  programming  approach  using 
invariant  convex  sets  to  construct  a  graph  Q,  specified  by  an  adjacency  matrix  A,  which 
satisfies  the  required  constraints.  Both  the  problem  as  well  the  solution  can  be  suitably 
modified  to  include  inequality  constraints. 

Solution  2.  We  combine  information  from  all  the  invariants  to  construct  an  invariant 
convex  set  C .  Given  a  constraint  of  the  form  hj(A )  =  otj,  we  consider  the  following 
convex  set: 

Cj  =  conv{^4  :  A  €  Sn,  hj(A)  =  ay}. 

This  set  is  convex  by  construction,  and  is  an  invariant  convex  set  if  hj  is  a  graph 
invariant.  If  hj  is  a  convex  graph  invariant  this  set  is  equal  to  the  sublevel  set  {A  : 
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A  €  Sn,  hj(A )  <  aj } .  Given  a  collection  of  constraints  {hj{A)  =  aj}  we  then  form  an 
invariant  convex  constraint  set  as  follows: 

C  =  Gj  Cj. 

Therefore  any  invariant  information  that  is  amenable  to  approximation  as  a  convex 
constraint  set  can  be  incorporated  in  such  a  framework.  For  example  constraints  on 
the  degree  distribution  or  the  spectrum  can  be  naturally  relaxed  to  tractable  convex  con¬ 
straints,  as  described  in  Section  6.3.4 ■  If  the  set  C  as  defined  above  is  intractable  to 
compute,  one  may  further  relax  C  to  obtain  efficient  approximations,  hi  many  cases 
of  interest  a  subset  of  the  boundary  of  C  corresponds  to  points  at  which  all  the  con¬ 
straints  are  active  {^4  :  hj(A )  =  aj}.  In  order  to  recover  one  of  these  extreme  points, 
we  maximize  a  random  linear  functional  defined  by  M  £  S"  f with  the  entries  in  the 
upper-triangular  part  chosen  to  be  independent  and  identically  distributed  to  zero-mean, 
variance-one  standard  Gaussians)  over  the  set  C : 

A  =  arg  max  Tr  (MA) 

AeS" 

s.t.  A  e  C. 

This  convex  program  is  successful  if  A  is  indeed  an  extreme  point  at  which  all  the 
constraints  {hj(A)  =  aj}  are  satisfied. 

Clearly  this  approach  is  well-suited  for  constructing  constrained  graphs  only  if  the 
convex  set  C  described  in  the  solution  scheme  contains  many  extreme  points  at  which 
all  the  constraints  are  satisfied.  The  next  result  gives  conditions  under  which  the  convex 
program  recovers  an  A  that  satisfies  all  the  given  constraints: 

Proposition  6.5.2.  Consider  the  problem  and  solution  setup  as  defined  above.  Define 
the  set  N  as  follows: 

N=  J  NC(A). 

{A  :  AeC,  hj{A)=otj  Vj} 

If  M  €  N  then  the  optimum  A  of  the  convex  program  (6.12)  satisfies  all  the  specified 
constraints  exactly.  In  particular  if  M  is  chosen  uniformly  at  random  as  described 
above,  then  the  probability  of  success  is  equal  to  the  fraction  of  Sn  covered  by  the  union 
of  the  normal  cones  N. 


(6.12) 
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Figure  6.4.  An  adjacency  matrix  of  a  sparse,  well-connected  graph  example  obtained  using  the 
approach  described  in  Section  6.5.2:  The  weights  of  this  graph  lie  in  the  range  [0, 1],  the  black  points 
represent  edges  with  nonzero  weight,  and  the  white  points  denote  absence  of  edges.  The  (weighted) 
degree  of  each  node  is  8,  the  average  number  of  nonzero  (weighted)  edges  per  node  is  8.4,  the  second- 
smallest  eigenvalue  of  the  Laplacian  is  4,  and  the  weighted  diameter  is  3. 


Proof.  The  proof  follows  from  standard  results  in  convex  analysis.  In  particular  we 
appeal  to  the  fact  that  a  linear  functional  defined  by  M  achieves  its  maximum  at 
A  £  C  if  and  only  if  M  €  Nc(A).  □ 

As  a  corollary  of  this  result  we  observe  that  if  the  invariant  information  provided 
exactly  characterizes  the  convex  hull  of  a  graph  Q .  then  the  set  C  above  is  the  convex 
hull  C{Q)  of  the  graph  Q .  In  such  cases  the  convex  program  given  by  (6.12)  produces  an 
adjacency  matrix  representing  Q  with  probability  one.  Next  we  provide  the  results  of 
a  simple  experiment  that  demonstrates  the  effectiveness  of  our  approach  in  generating 
sparse  graphs  with  large  spectral  gap. 

Example.  In  this  example  we  aim  to  construct  graphs  on  n  =  40  nodes  with 
adjacency  matrices  in  A  that  have  degree  d  =  8,  node  weights  equal  to  zero,  and 
the  second-smallest  eigenvalue  of  the  Laplacian  being  larger  than  e  =  4.  The  goal  is  to 
produce  relatively  sparse  graphs  that  satisfy  these  constraints.  The  specified  constraints 
can  be  used  to  construct  a  convex  set  as  follows: 

C  =  {A  :  A  e  A,  gAl  =  1,  An_i(IvO  >  4,  An  =  0  Vi}. 

By  maximizing  100  random  linear  functionals  over  this  set  we  obtained  graphs  in  all 
100  cases  with  total  degree  equal  to  8,  and  in  98  of  the  100  cases  with  the  minimum 
eigenvalue  of  the  Laplacian  equal  to  4  (it  is  greater  than  4  in  the  remaining  two  cases). 
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Interestingly  the  average  number  of  edges  with  nonzero  weight  incident  on  each  node 
is  8.8  over  these  100  trials,  thus  providing  very  sparse  graphs  that  are  well-connected. 
Figure  6.4  gives  an  example  of  a  graph  generated  randomly  using  this  procedure;  the 
average  number  of  nonzero  (weighted)  edges  per  node  of  this  graph  is  8.4,  and  its 
(weighted)  diameter  is  3.  Therefore  this  approach  empirically  yields  sparse  graphs  that 
are  well-connected  (i.e.,  with  a  large  spectral  gap). 

We  would  like  to  point  out  here  a  different  approach  to  constructing  well-connected 
graphs,  which  tries  to  add  edges  from  a  subset  of  candidate  edges  to  maximize  the 
second  eigenvalue  of  the  graph  Laplacian  [69].  An  interesting  question  is  to  understand 
the  structure  of  the  extreme  points  of  the  set  C  in  this  example  as  the  graph  size  and 
the  degree  (n,  d)  grow  large,  with  e  held  constant.  For  example  it  may  be  useful  to 
compute  the  fraction  of  the  normal  cones  at  those  extreme  points  corresponding  to 
expander  graphs.  More  generally  it  is  of  interest  to  give  conditions  on  constraint  sets 
under  which  the  procedure  described  above  is  successful  in  providing  graphs  that  satisfy 
all  the  constraints  with  high  probability. 

■  6.5.3  Application:  Graph  hypothesis  testing 

Finally  we  give  a  solution  to  the  hypothesis  testing  problem  in  which  we  have  two 
families  of  graphs,  and  the  goal  is  to  decide  which  of  these  families  offers  a  “better 
explanation”  for  a  given  candidate  “sample”  graph. 

Problem  3.  Let  F\  and  ^2  denote  two  families  of  graphs  characterized  in  terms  of  in¬ 
variants  {hj}  and  {hj}  respectively;  for  example,  a  family  could  be  specified  as  some  set 
of  graphs  that  have  similar  spectral  distributions,  similar  degree  sequences,  and  similar 
girths.  Given  a  graph  Q ,  which  of  the  two  families  F\ ,  J~2  of  graphs  is  more  similar  to 

g? 


We  emphasize  that  the  sets  of  invariants  that  characterize  T\ ,  Ti  may  in  general  be 
very  different.  Note  that  this  question  is  not  completely  well-posed,  as  there  may  be 
different  answers  depending  on  one’s  notion  of  similarity.  In  order  to  address  this  point, 
we  need  to  develop  a  statistical  theory  for  graphs.  In  such  a  setting  one  could  phrase 
this  question  formally  as  a  statistical  hypothesis  testing  problem  with  appropriate  error 
metrics.  Our  focus  in  the  present  chapter  is  on  proposing  a  convex  optimization  solution 
to  the  hypothesis  testing  based  on  convex  graph  invariants,  and  using  a  reasonable 
notion  of  similarity. 
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Solution  3.  Let  A  E  Sn  be  an  adjacency  matrix  that  represents  the  graph  Q .  We  con¬ 
struct  invariant  convex  sets  C\  and  C2  based  on  the  sets  of  invariants  {/ij},  {hj}  in  an 
analogous  manner  to  the  construction  described  in  the  solution  to  the  graph  construc¬ 
tion  problem  of  Section  6.5.2.  As  before  one  could  employ  further  tractable  relaxations 
of  these  sets  if  they  are  intractable  to  compute.  Assuming  that  these  convex  constraint 
sets  that  summarize  the  families  T\  and  T-j  are  compact,  we  declare  that  T\  is  closer 
to  Q  than  J~2  if  the  following  holds: 

max  TV  (AM)  >  max  Tr(AM).  (6.13) 

MeCi  MeC2 

Naturally  we  declare  the  opposite  result  if  the  inequality  is  switched.  Computing  the  two 
sides  in  this  test  can  be  done  via  convex  optimization,  and  this  computation  is  tractable 
if  C\ ,  C2  are  tractable  to  characterize. 

Our  choice  of  the  function  to  be  maximized  over  C \ ,  C 2  is  motivated  by  a  similar 
procedure  in  statistics  and  signal  processing,  which  goes  by  the  name  of  “matched 

filtering.”  Of  course  other  (convex  invariant)  cost  functions  can  also  be  optimized 

depending  on  one’s  notion  of  similarity.  We  point  out  two  advantages  of  this  approach 
to  hypothesis  testing.  First  the  two  families  of  graphs  can  be  specified  in  terms  of 
different  sets  of  invariants,  as  seen  in  these  examples.  Second  the  optimal  solutions 
of  the  convex  programs  in  (6.13)  in  fact  provide  approximations  to  the  graph  Q  by 
elements  in  the  families  J-\,J~2.  We  give  illustrations  of  these  points  in  our  examples, 
which  we  describe  next. 

Example.  Let  Acycie  denote  the  adjacency  matrix  of  a  16-node  unweighted  cycle. 
As  our  first  family  we  consider  the  set  of  cycles  on  16  nodes.  We  approximate  this  family 
by  the  set  of  graphs  that  are  triangle-free  (in  the  sense  described  in  Section  6.3.4),  have 
degree  equal  to  2,  and  have  the  same  spectrum  as  a  16-node  unweighted  cycle.  Therefore 
the  set  C\  is  defined  as  follows: 

Cl  =  {A  :  A  e  A,  An  =  0  Vi,  \Al  =  1,  0a-3(A)  <  4}  n  £(Acycle). 

As  our  second  family,  we  consider  sparse  well-connected  graphs  on  16  nodes  with  maxi¬ 
mum  weighted  degree  less  than  or  equal  to  2.5,  and  with  the  second-smallest  eigenvalue 
of  the  Laplacian  bounded  below  by  1.1: 


C2  =  {A:AeA,  Au  =  0  Vi,  (Al)*  <  2.5  Vi,  \n-i(LA)  >  1.1}. 
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Applying  the  solution  described  above  to  a  test  graph  given  by  a  16-node  unweighted 
path  graph  (i.e.,  an  unweighted  cycle  with  an  edge  removed,  see  Figure  6.2),  we  find 
that  the  path  graph  is  “closer”  to  the  family  of  cycles  approximated  by  the  set  C\ 
than  it  is  to  the  family  J~2-  This  agrees  with  the  intuition  that  a  path  graph  is  not 
well-connected,  and  is  only  one  edge  away  from  being  a  cycle.  We  also  point  out  that 
the  optimal  solution  to  the  convex  program  on  the  left-hand-side  of  the  test  (6.13)  is 
in  fact  an  unweighted  16-node  cycle  with  the  missing  edge  in  the  path  graph  added 
as  an  extra  edge.  Next  we  consider  a  different  test  graph  -  a  16-node  cycle  with  two 
additional  edges  across  diametrically  opposite  nodes,  i.e.,  assuming  we  label  the  nodes 
of  the  16-node  cycle  we  add  edges  between  nodes  1  and  9,  and  between  nodes  5  and  13 
(again  see  Figure  6.2).  While  this  graph  is  only  two  edges  away  from  being  a  cycle,  the 
edges  connecting  far  away  nodes  dramatically  increase  the  connectivity  of  the  graph.  In 
this  case  we  find  using  the  convex  programming  hypothesis  test  (6.13)  that  the  family 
J~2  is  in  fact  closer  than  T\  to  the  sample  graph.  Interestingly,  the  optimal  solution 
to  the  convex  program  on  the  left-hand-side  of  the  test  (6.13)  is  again  an  unweighted 
16-node  cycle,  this  time  with  the  two  additional  edges  removed. 

In  order  to  thoroughly  address  the  graph  hypothesis  testing  problem,  we  need  to  de¬ 
velop  a  framework  of  statistical  models  over  spaces  of  graphs.  With  a  proper  statistical 
framework  in  place  we  can  evaluate  the  probability  of  error  achieved  by  a  hypothesis¬ 
testing  algorithm  with  respect  to  a  suitable  error-metric,  analogous  to  similar  methods 
developed  in  other  classical  decision-theoretic  problems  in  statistics.  We  defer  these 
questions  to  a  separate  paper. 

■  6.6  Discussion 

In  this  chapter  we  introduced  and  studied  convex  graph  invariants,  which  are  graph 
invariants  that  are  convex  functions  of  the  adjacency  matrix.  Convex  invariants  form 
a  rich  subset  of  the  set  of  all  graph  invariants,  and  they  are  useful  in  developing  a 
unified  computational  framework  based  on  convex  optimization  to  solve  a  number  of 
graph  problems.  In  particular  we  described  three  canonical  problems  involving  the 
structural  properties  of  graphs,  namely,  graph  construction  given  constraints,  graph 
deconvolution  of  a  composite  graph  into  individual  components,  and  graph  hypothesis 
testing  in  which  the  objective  is  decide  which  of  two  given  families  of  graphs  offers  a 
better  explanation  for  a  new  sample  graph.  We  presented  convex  optimization  solutions 
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to  all  of  these  problems,  with  convex  graph  invariants  playing  a  prominent  role.  These 
solutions  provided  attractive  empirical  performance,  and  the  resulting  convex  programs 
are  tractable  and  can  be  solved  using  general-purpose  off-the-shelf  software  for  moderate 
size  instances. 
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Chapter  7 


Conclusion 


The  central  theme  of  this  thesis  is  to  provide  solutions  to  address  some  of  the  challenges 
that  arise  in  modeling  the  interactions  among  a  large  collection  of  variables.  Here  we 
describe  the  main  contributions,  and  discuss  some  future  research  directions. 

■  7.1  Summary  of  Contributions 

Rank-Sparsity  Uncertainty  Principles  and  Matrix  Decomposition 

In  Chapter  3  we  studied  the  question  of  decomposing  the  sum  of  a  sparse  matrix  and 
a  low-rank  matrix  into  the  individual  components.  Such  a  decomposition  problem 
arises  in  a  number  of  applications  in  system  identification,  computational  complexity, 
and  statistical  model  selection.  Indeed  sparse-plus-low-rank  matrix  decomposition  is 
central  to  Gaussian  latent-variable  graphical  model  selection  addressed  in  Chapter  4. 
We  proposed  a  tractable  convex  program  to  solve  the  decomposition  problem,  and  gave 
conditions  under  which  it  exactly  identifies  the  correct  components.  Fundamental  to  the 
analysis  in  Chapter  3  is  a  new  rank-sparsity  uncertainty  principle  relating  the  sparsity 
pattern  of  a  matrix  to  its  row  and  column  spaces. 

Latent  Variable  Graphical  Model  Selection  via  Convex  Optimization 

Latent  variable  model  selection  is  a  major  challenge  in  statistics,  and  is  also  a  problem 
of  fundamental  interest  because  the  discovery  of  hidden  causes  affecting  some  observed 
phenomena  is  important  in  many  scientific  endeavors.  Our  main  contribution  in  this 
area  is  a  new  convex  optimization  method  with  theoretical  consistency  guarantees  for 
graphical  model  selection  with  latent  variables.  Specifically  this  convex  program  builds 
upon  the  framework  in  Chapter  3,  and  our  analysis  gives  conditions  under  which  the 
program  consistently  estimates  model  structure  in  the  high-dimensional  scaling  regime. 
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The  Convex  Geometry  of  Linear  Inverse  Problems 

The  abstract  mathematical  formulations  underlying  many  problems  involving  graphs 
and  graphical  models  can  in  fact  be  viewed  as  instances  of  inverse  problems  in  which 
we  wish  to  learn/reconstruct  structured  graphs  and  simple  statistical  models  given 
inexact  or  incomplete  information.  Chapter  5  develops  tractable  convex  relaxations 
for  a  general  class  of  inverse  problems  in  which  the  objective  is  to  recover  certain 
“simple”  models  given  a  limited  number  of  linear  measurements.  In  situations  when 
the  underlying  models  have  algebraic  structure,  the  resulting  convex  programs  can  be 
solved  or  approximated  by  semidefinite  programming.  We  provide  sharp  estimates  of 
the  number  of  generic  measurements  required  for  exact  and  robust  recovery  in  a  variety 
of  settings.  These  estimates  are  based  on  computing  certain  Gaussian  statistics  related 
to  the  underlying  model  geometry. 

Convex  Graph  Invariants 

Finally  we  consider  questions  motivated  by  statistical  models  over  the  space  of  graphs, 
so  that  a  graph  itself  is  viewed  as  a  sample  drawn  from  a  probability  distribution  defined 
over  some  set  of  graphs.  Natural  questions  that  arise  in  standard  statistical  settings 
can  then  be  posed  in  a  deterministic  framework  in  this  graph  setting  as  well.  For 
example  we  consider  problems  such  as  graph  deconvolution,  graph  sampling,  and  graph 
hypothesis  testing.  In  order  to  develop  a  unified  computational  framework  to  solve 
these  problems,  we  introduce  convex  graph  invariants  in  Chapter  6.  We  also  discuss 
connections  to  other  concepts  such  as  majorization,  robust  optimization,  and  graph 
isomorphism. 

■  7.2  Future  Directions 
Special-Purpose  Computational  Methods 

Many  of  the  convex  programs  proposed  in  this  thesis  can  be  solved  in  polynomial-time 
using  general-purpose  software  for  moderate-size  problem  instances.  However  it  is  of 
interest  to  apply  some  of  the  convex  programs  (e.g.,  latent-variable  graphical  model 
selection  in  Chapter  4,  or  the  computation  of  some  subset  of  convex  graph  invariants 
in  Chapter  6)  to  large-scale  problems  instances.  Therefore  special-purpose  algorithms 
tailored  to  specific  structured  convex  programs  must  be  developed  to  scale  to  massive 
problem  sizes. 
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Non-Gaussian  Latent-Variable  Modeling 

The  methods  and  analysis  in  Chapter  4  are  relevant  for  Gaussian  model  selection.  In 
many  applications  of  interest,  e.g.,  in  computational  biology,  the  random  variables  of 
interest  are  fundamentally  non-Gaussian.  Therefore  it  is  important  to  develop  a  similar 
convex  optimization  formulation  with  consistency  guarantees  for  latent-variable  models 
with  non-Gaussian  variables,  e.g.,  for  categorical  data. 

Computational  Approximations  and  Tradeoffs 

Some  of  the  convex  programs  proposed  in  Chapter  5  and  in  Chapter  6  cannot  be 
solved  in  polynomial-time,  and  therefore  we  proposed  in  those  chapters  further  convex 
relaxations  which  are  tractable  to  solve.  A  basic  question  of  interest  in  several  settings 
is  the  tradeoff  incurred  due  to  these  tractable  relaxations.  For  example  in  Chapter  5 
the  tradeoff  can  be  specified  in  terms  of  the  increased  number  of  linear  measurements 
required  for  guaranteed  recovery  via  convex  optimization. 

Non-Gaussian  Linear  Measurement  Models 

In  Chapter  5  we  analyze  the  recovery  guarantees  of  convex  relaxation  methods  in  ex¬ 
tracting  structured  models  given  linear  measurements  specified  by  random  Gaussian 
functionals.  While  such  an  analysis  is  useful  for  general  atomic  sets,  particular  appli¬ 
cations  often  necessitate  the  study  of  structured  measurement  matrices,  e.g.,  partial 
Fourier  measurements  of  sparse  vectors  or  partial  entrywise  sampling  of  low-rank  ma¬ 
trices.  It  is  of  interest  to  develop  a  unified  framework  based  on  a  notion  of  incoherence 
that  is  general  enough  to  encompass  most  interesting  applications. 

Conditions  for  Graph  Deconvolution  and  Graph  Generation 

A  further  challenge  that  we  are  presently  working  to  address  is  to  provide  theoretical 
guarantees  on  the  performance  of  our  convex  programs  described  in  Chapter  6.  For  ex¬ 
ample  which  families  of  graphs  can  be  deconvolved  via  the  tractable  spectral  relaxation? 
Which  classes  of  structured  graphs  can  be  generated  efficiently  via  convex  optimization? 


160 


CHAPTER  7. 


CONCLUSION 


Appendix  A 


Proofs  of  Chapter  3 


A.l  SDP  Formulation 


The  problem  (3.3)  can  be  recast  as  a  semidefinite  program  (SDP).  Using  the  variational 
characterizations  of  the  t\  norm  and  the  nuclear  norm  from  Chapter  2,  (3.3)  can  be 
rewritten  as 

.  min  +  -(trace(lDi)  +  trace(W2)) 

A,B,Wi,W2,Z  1 


s.t. 


>-  0 


Here,  ln  G 


Wi  B 
B'  W2 

Zij  ^  A <  Z ij, 

A  +  B  =  C. 

refers  to  the  vector  that  has  1  in  every  entry. 


(A.l) 


■  A. 2  Proofs 

Proof  of  Proposition  3.3.1 

We  begin  by  establishing  that 

Nenl ?SV||<,  ll'WJ'OII  <  1  =*•  n  =  {0},  (A.2) 

where  Pq(a*)(N)  denotes  the  projection  onto  the  space  fi(A*).  Assume  for  the  sake 
of  a  contradiction  that  this  assertion  is  not  true.  Thus,  there  exists  N  ^  0  such  that 
N  G  H(A*)  C\T(B*).  Scale  N  appropriately  such  that  \\N\\  =  1.  Thus  N  G  T(B *)  with 
||1V||  =  1,  but  we  also  have  that  ||-Fh(A*)(-^0ll  =  ||AT||  =  1  as  N  G  fl(A*).  This  leads  to 
a  contradiction. 
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Next,  we  show  that 


max 

N&T(B*),  ||JV||<1 


which  would  allow  us  to  conclude  the  proof  of  this  proposition.  We  have  the  following 
sequence  of  inequalities 


max 

N£T(B*),  ||JV||<1 


P(KA*)(n)W 


<  max 
NeT(B*),  ||Ar||<l 

<  max 
N£T(B *),  ||iV||<l 


M(A*)||Pn(A*)(iV)||c 


Here  the  first  inequality  follows  from  the  definition  (3.2)  of  n{A*)  as  P<i(A*)(N)  G  12(A*), 
the  second  inequality  is  due  to  the  fact  that  ||-PfiM*)(A0||oo  <  ll-ZVHoo,  and  the  final 
inequality  follows  from  the  definition  (3.1)  of  £(£?*).  □ 


Proof  of  Proposition  3.4.1 

We  first  show  that  is  an  optimum  of  (3.3),  before  moving  on  to  showing 

uniqueness.  Based  on  subgradient  optimality  conditions  applied  at  (A*,B*),  there 
must  exist  a  dual  Q  such  that 

<2  G  7<9||H*||i  and  Q  G  <9||.B*||*. 

The  second  condition  in  this  proposition  guarantees  the  existence  of  a  dual  Q  that 
satisfies  both  these  conditions  simultaneously  (see  (3.11)  and  (3.12)).  Therefore,  we 
have  that  ( A*,B *)  is  an  optimum.  Next  we  show  that  under  the  conditions  specified 
in  the  lemma,  ( A*,B *)  is  also  a  unique  optimum.  To  avoid  cluttered  notation,  in  the 
rest  of  this  proof  we  let  =  12(^4*),  T  =  T(B*),  12C(H*)  =  12c,  and  T±(B*)  =  T± . 

Suppose  that  there  is  another  feasible  solution  {A*  +  Na,B*  +  Nb)  that  is  also  a 
minimizer.  We  must  have  that  Na  +  Nb  =  0  because  A*  +  B*  =  C  =  (H*  +  Ai^)  +  (H*  + 
Nb)-  Applying  the  subgradient  property  at  (A*,  B*),  we  have  that  for  any  subgradient 
( Qa,Qb )  of  the  function  7||A||i  +  ||i?||*  at  ( A*,B *) 

7P*  +  NA\h  +  || B*  +  iVB||*  >  7PII1  +  ||5*||*  +  (Qa,  Na)  +  (Qb,  Nb)-  (A.3) 

Since  ( Qa,Qb )  is  a  subgradient  of  the  function  7||A||i  +  ||S||*  at  (A*,B*),  we  must 
have  from  (3.11)  and  (3.12)  that 
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•  Qa  =  7sign(A*)  +  Pnc(QA),  with  ||P^(<5a)||oo  <  7- 

•  Qb  =  UV'  +  Pt±(Qb),  with  \\Pt±(Qb)\\  <  1. 

Using  these  conditions  we  rewrite  ( QA,NA )  and  ( Qb,Nb )■  Based  on  the  existence  of 
the  dual  Q  as  described  in  the  lemma,  we  have  that 

(Qa,Na)  =  (7sign(A*)  +  P^c(Qa),Na) 

=  (Q-p^(Q)  +  p^(Qa),na) 

=  ( Pnc(QA)~Pnc(Q),NA)  +  (Q,NA ),  (A.4) 

where  we  have  used  the  fact  that  Q  =  7sign(A*)  +  Pn<(Q).  Similarly,  we  have  that 

(Qb,Nb)  =  ( UV'  +  Pt±{Qb),Nb ) 

=  (Q  —  Pt_l(Q)  +  Pt±{Qb),Nb) 

=  {PT±(Qb)  -  Pt±{Q),Nb)  +  (Q,Ns),  (A. 5) 

where  we  have  used  the  fact  that  Q  =  UV'+PT±  (Q).  Putting  (A.4)  and  (A. 5)  together, 
we  have  that 

(Qa,Na)  +  (Qb,Nb)  =  (P^(Qa)~P^(Q),Na) 

P(Pt±(Qb)  —  Pt±(Q)^b) 

+{Q i  na  +  nb) 

=  (Pac(QA)-PQc(Q),NA) 

+(Pt±{Qb)  —  Pt±(Q)^b) 

=  (P^(Qa)  ~  Pnc(Q),Pte(NA)) 

+(Prx(QB)-PTx(Q),Prx(IVs)).  (A. 6) 

In  the  second  equality,  we  used  the  fact  that  NA  +  Nb  =  0. 

Since  ( QA,Qb )  is  any  subgradient  of  the  function  7 1 1 x4 1 1 1  +  ||B||*  at  ( A*,B *),  we 
have  some  freedom  in  selecting  Pq.c{Qa)  and  Pt±(Qb)  as  long  as  they  still  satisfy  the 
subgradient  conditions  ||Pnc(QA)||oo  <  7  and  \\Pt-l(Qb)\\  <  1.  We  set  Pqc(Qa)  = 
7sign(Pnc(ATj4))  so  that  ||P^(Qa)||oo  =  7  and  {Pq<=(Qa),P^(Na))  =  7||Pqc(ATa)||i. 
Letting  Pt±(Nb )  =  UTjVt  be  the  singular  value  decomposition  of  Pt±(Nb ),  we  set 
Pt±(Qb)  =  UVT  so  that  \\Pt±(Qb)\\  =  1  and  (Prx(Qs),  Ptx(7Vb)>  =  ||PTx(iVi?)||*. 
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With  this  choice  of  (Qa,Qb),  we  can  simplify  (A. 6)  as  follows: 

(Qa,Na)  +  (Qb,Nb)  >  (7-||JPnc(Q)||0O)(||Pac(iVA)||i) 

+(l-||PTx(Q)||)(||PTx(fVB)|U). 

Since  ||Phc(Q)||oo  <  7  and  \\Pt±{Q)\\  <  1>  we  have  that  (Qa,  Na)  +  ( Qb ,  NB)  is  strictly 
positive  unless  Pqc(Na)  =  0  and  Pt±(Nb)  =  0.  Thus,  7||A*  +  NA\\ i  +  ||-B*  +  NB\\*  > 
7P*||i  +  ||S*||*  if  Psic(Na)  +  0  and  PT±(NB )  +  0.  However,  if  Pnc{NA)  =  PT±(NB )  = 
0,  then  Pq(Na)  +  Pt(Nb)  =  0  because  we  also  have  that  NA  +  NB  =  0.  In  other 
words,  Pq(Na)  =  —Pt(Nb).  This  can  only  be  possible  if  Pq(Na)  =  Pt(Nb)  =  0  (as 
ft  n  T  =  {0}),  which  in  turn  implies  that  NA  =  NB  =  0.  Therefore,  7||A*  +  Na\\i  + 
|| B*  +  IVb||*  >  7 1 1 x4* 1 1 1  +  ||-B*||*  unless  NA  =  NB  =  0.  □ 


Proof  of  Theorem  3.4.1 


As  with  the  previous  proof,  we  avoid  cluttered  notation  by  letting  ft  =  ft  (A*),  T  = 
T(B*),  flc(A*)  =  flc,  and  T^(B*)  =  TL.  One  can  check  that 


£(B*) 

1-4  £(B*)fi{A* 


1  -  3 £(H*)/r(A* 
li(A*) 


Thus,  we  show  that  if  £(B*)n(A*)  <  |  then  there  exists  a  range  of  7  for  which  a  dual 
Q  with  the  requisite  properties  exists.  Also  note  that  plugging  in  £(B*)p,(A*)  =  g  in 
the  above  range  gives  the  strictly  smaller  range  [3£(i?*),  2fi(A*)  I  f°r  T  f°r  any  choice  of 
p  G  [0, 1]  we  have  that  7  =  *s  always  within  the  above  range. 

We  aim  to  construct  a  dual  Q  by  considering  candidates  in  the  direct  sum  ft  ©  T  of 
the  tangent  spaces.  Since  p(A*)^(B*)  <  g,  we  can  conclude  from  Proposition  3.3.1  that 
there  exists  a  unique  Q  G  f2©T  such  that  Pq(Q)  =  7sign(A*)  and  Pt{Q)  =  UV'  (recall 
that  these  are  conditions  that  a  dual  must  satisfy  according  to  Proposition  3.4.1),  as 
ft  n  T  =  {0}.  The  rest  of  this  proof  shows  that  if  p(A*)£(B*)  <  g  then  the  projections 
of  such  a  Q  onto  T1-  and  onto  flc  will  be  small,  i.e. ,  we  show  that  ||Pbc(Q)||oo  <  7  and 
||PTx(Q)||  <  1. 

We  note  here  that  Q  can  be  uniquely  expressed  as  the  sum  of  an  element  of  T  and 
an  element  of  ft,  i.e.,  Q  =  Qq  +  Qt  with  Qq  G  hi  and  Qt  G  T.  The  uniqueness  of 
the  splitting  can  be  concluded  because  ft  fl  T  =  {0}.  Let  Qq  =  7sign(A*)  +  en  and 
Qt  =  UV'  +  e_T-  We  then  have 


Pn{Q)  =  7sign(A*)  +  en  +  Pu{Qt)  =  7sign(A*)  +  en  +  Pn(UV'  +  eT). 
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Since  Pq(Q)  =  7sign(A*), 

en  =  ~Pn(UV' +  eT).  (A.8) 

Similarly, 

eT  =  --Pr(7sign(A*)  +  en).  (A. 9) 

Next,  we  obtain  the  following  bound  on  ||-Pnc(0)Hoo: 

)|fh=(Q)||oo  =  \\Pnc(UV' +  eT)\\oo 

<  \\UV'  +  er||oo 

<  Z(B*)\\UV' +  eT\\ 

<  £(P*)(l  +  ||er||),  (A. 10) 

where  we  obtain  the  second  inequality  based  on  the  definition  of  £(B*)  (since  UV' +€t  £ 
T).  Similarly,  we  can  obtain  the  following  bound  on  ||Pr±(Q)|| 

|jPTx(Q)||  =  ||PT-L(7sign(A*)  +  en)|| 

<  ||7sign(A*) +  eo|| 

<  /x(A*)||7sign(A*)  +  eo||oo 

<  /i(A*)(7  +  ||en||oo),  (A. 11) 

where  we  obtain  the  second  inequality  based  on  the  definition  of  fJ,(A*)  (since  ysign(A*)+ 
en  £  fi).  Thus,  we  can  bound  ||Pnc(0)l|oo  and  \\Pt±{Q)\\  by  bounding  1 1 e^1 1 1  and  ||en||oo 
respectively  (using  the  relations  (A. 9)  and  (A.8)). 

By  definition  of  £(£?*)  and  using  (A.8), 

Moo  =  \\Pn(UV' +  eT)\\oo 

<  \\UV'  +  eT\\oc 

<  £(B*)\\UV' +  eT  || 

<  e(S*)(l  +  l|er||),  (A. 12) 

where  the  second  inequality  is  obtained  because  UV'  +  ex  £  T.  Similarly,  by  definition 
of  n(A*)  and  using  (A. 9) 

IMI  =  ||Pr(7sign(A*)  +  en)|| 

<  2||7sign(A*)  +  en|| 

<  2|u(A*)||7sign(A*)  +  en  ||oo 

<  +  ||en||oo), 


(A.13) 
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where  the  first  inequality  is  obtained  because  ||Pt(M)||  <  2||M||,  and  the  second  in¬ 
equality  is  obtained  because  7sign(A*)  +  e  12. 

Putting  (A.  12)  in  (A.  13),  we  have  that 


IM  <  2//(A*)(7  +  e(5*)(l  +  ||er| 

l|€T11  -  1-2  {(B*)„(A*)  ' 

Similarly,  putting  (A.  13)  in  (A.  12),  we  have  that 


en||oo  <  £(-B*)(l  +  2/j,(A*)('y  +  ||en||oo)) 
gB*)  +  21gB*)n(A*) 

1  n||o°  -  1-2  Z(B*)n(A*)  ' 


We  now  show  that  ||PT±(Q)||  <  1.  Combining  (A. 15)  and  (A. 11), 
/  A\ ii  ,  ..,,,7.  «B-)  +  27«BW)\ 


|Pt±(Q)||  <  n(A*)  (  7  + 


= 


<  »(A*) 


=  1, 

since  7  <  ^  by  assumption. 


'  1  -  2 £(B*MA*) 

7 \ 

1-2  Z(B*)ii(A*)) 

1  -  2 £(H*)/r(A*) 


=  H(B* 


-7  +7 


Finally,  we  show  that  ||-Poc(Q)||oo  <  7-  Combining  (A. 14)  and  (A. 10) 

llPnc(Q)llo°  ^  H  1  - W^W)  ) 

/^+2tMA*M 

^1-2^*)^*); 

=  £( 5*)  f  1  +  27/^^  ^  ^  -  7]  +  7 

+  2^(B*)ii(A*)  -  7  +  2 
1  -  2 :£(£*)/i(A*) 
•^*)-7(1-4C(5*)/x(A*))] 

1  -  J  7 

-  tm-gB*)  1 
_l-2£(5*)/x(A*)J  / 

=  7- 


(A.14) 


(A.15) 


Here,  we  used  the  fact  that  <  7  in  the  second  inequality.  □ 
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Proof  of  Proposition  3.4.2 

Based  on  the  Perron-Frobenius  theorem  [82],  one  can  conclude  that  ||P||  >  ||Q||  if 
Pi,j  >  \Qi,j\,  V  i,j-  Thus,  we  need  only  consider  the  matrix  that  has  1  in  every  location 
in  the  support  set  12(A)  and  0  everywhere  else.  Based  on  the  definition  of  the  spectral 
norm,  we  can  re-write  n(A)  as  follows: 


MA) 


max 

II 2  =  1> H2/H  2  =  1 


Xiyr 

( i,j)eU(A ) 


Upper  bound  For  any  matrix  M,  we  have  from  the  results  in  [130]  that 


(A.16) 


||M||2  <  max  TiCj,  (A. 17) 

where  r j  =  Yhk  \^i,k\  denotes  the  absolute  row-sum  of  row  i  and  c j  =  Yhk  \^k,j\  denotes 
the  absolute  column-sum  of  column  j.  Let  M be  a  matrix  defined  as  follows: 

^Q(A)  _  f  1)  (hj)  ^  H(A) 
l'3  I  0,  otherwise. 

Based  on  the  reformulation  of  /x(A)  above  (A.16),  it  is  clear  that 

H{A)  =  ||MnA)||. 


From  the  bound  (A. 17),  we  have  that 


\\Mn(A)\\  <  degmax(A). 


Lower  bound  Now  suppose  that  each  row/column  of  A  has  at  least  degmin(A)  nonzero 
entries.  Using  the  reformulation  (A.16)  of  /a(A)  above,  we  have  that 


m(A)  > 


£ 


1  1 


|  support  (A)  | 
n 


—  degmin(A). 


Here  we  set  x  =  y  =  ^-7^=  1 ,  with  1  representing  the  all-ones  vector,  as  candidates  in  the 
optimization  problem  (A.16).  □ 


Proof  of  Proposition  3.4.3 

Let  B  =  UTjVt  be  the  SVD  of  B. 


168 


APPENDIX  A.  PROOFS  OF  CHAPTER  3 


Upper  bound  We  can  upper-bound  £(P)  as  follows 


m  = 


< 

< 

< 


max  ||M||oo 

MeT(B),\\M\\<l 

max  ||PT(B)(M)||c 

maXi||PT(B)(M)||0O 

, ,  max  ||Pt(b)(M)||0O 

M  orthogonal 


max  ||P[/M|| 

M  orthogonal 


oo  +  max 

M  orthogonal 


11(4 


Pu)MPV  ||oo 


For  the  second  inequality,  we  have  used  the  fact  that  the  maximum  of  a  convex  function 
over  a  convex  set  is  achieved  at  one  of  the  extreme  points  of  the  constraint  set.  The 
orthogonal  matrices  are  the  extreme  points  of  the  set  of  contractions  (i.e.,  matrices  with 
spectral  norm  <  1).  Note  that  for  the  non-square  case  we  would  need  to  consider  partial 
isometries;  the  rest  of  the  proof  remains  unchanged.  We  have  used  PT^(M)  =  PjjM  + 
MPy  —  PjjMPy  from  (3.8)  in  the  last  inequality,  where  Py  =  UUT  and  Py  =  VVT 
denote  the  projections  onto  the  spaces  spanned  by  U  and  V  respectively. 

We  have  the  following  simple  bound  for  \\PyM Hoc  with  M  orthogonal: 


max  \\PjjMWoo  =  max  max  ej  PyMej 

M  orthogonal  M  orthogonal  i,j 

<  max  max  || P[/ej || 2  1 1 AP e ^  1 1 2 

M  orthogonal  i,j 

=  max  ||P[/e.;||2  x  max  max  1 1 7VP e^- 1 1 2 

i  M  orthogonal  j 

=  m- 


(A.i8) 


Here  we  used  the  Cauchy-Schwartz  inequality  in  the  second  line,  and  the  definition  of 
/ 3  from  (3.13)  in  the  last  line. 

Similarly,  we  have  that 


max 

M  orthogonal 


||  (-fnxn 


Pu)MPy  Hoc  = 
< 


< 


max  max  ej ( Inxn  ~  Pu)MPyej 

M  orthogonal  i,j 

max  max \\(Inxn  -  Pc/)e.t||2  \\MPyej\\2 

M  orthogonal  i,j 

max  ||(/nxn  —  P(/)e,;||2  x  max  max  || MPye 

i  M  orthogonal  j 

1  x  max  1 1  Pv  e  j  1 1 2 

/3(V).  (A. 


Using  the  definition  of  inc(P)  from  (3.14)  along  with  (A.  18)  and  (A.  19),  we  have 
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that 

£(B)  <  f3(U)  +  j3(V)  <  2  inc(S). 

Lower  bound  Next  we  prove  a  lower  bound  on  £(B) .  Recall  the  definition  of  the  tangent 
space  T(B )  from  (3.7).  We  restrict  our  attention  to  elements  of  the  tangent  space  T(B) 
of  the  form  PjjM  =  UUT M  for  M  orthogonal  (an  analogous  argument  follows  for 
elements  of  the  form  PyM  for  M  orthogonal).  One  can  check  that 


PjjM ||  =  max  xTPuMy  <  max  H-Pc/^lb  max  ||My||2  <  1. 
||*||2  =  l,||j/||2  =  l  ||x||2  =  l  l|y||2  =  l 


Therefore, 


£{B)  >  max  \\PuM \\oo 

M  orthogonal 


Thus,  we  only  need  to  show  that  the  inequality  in  line  (2)  of  (A.  18)  is  achieved  by 
some  orthogonal  matrix  M  in  order  to  conclude  that  £(.B)  >  /3(U).  Define  the  “most 
aligned”  basis  vector  with  the  subspace  U  as  follows: 


i*  =  arg  max  \\Puei\\2- 
i 

Let  M  be  any  orthogonal  matrix  with  one  of  its  columns  equal  to  Pyei*,  i.e.,  a  nor¬ 
malized  version  of  the  projection  onto  U  of  the  most  aligned  basis  vector.  One  can  check 
that  such  a  orthogonal  matrix  achieves  equality  in  line  (2)  of  (A.  18).  Consequently,  we 
have  that 

t(B)  >  max  HPc/MHoo  =  /3(D). 

M  orthogonal 

By  a  similar  argument  with  respect  to  V,  we  have  the  lower  bound  as  claimed  in  the 
proposition.  □ 
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Appendix  B 


Proofs  of  Chapter  4 


■  B.l  Matrix  Perturbation  Bounds 

Given  a  low-rank  matrix  we  consider  what  happens  to  the  invariant  subspaces  when  the 
matrix  is  perturbed  by  a  small  amount.  We  assume  without  loss  of  generality  that  the 
matrix  under  consideration  is  square  and  symmetric,  and  our  methods  can  be  extended 
to  the  general  non-symmetric  non-square  case.  We  refer  the  interested  reader  to  [7, 87] 
for  more  details,  as  the  results  presented  here  are  only  a  brief  summary  of  what  is 
relevant  for  this  Appendix.  In  particular  the  arguments  presented  here  are  along  the 
lines  of  those  presented  in  [7].  The  appendices  in  [7]  also  provide  a  more  refined  analysis 
of  second-order  perturbation  errors. 

The  resolvent  of  a  matrix  M  is  given  by  (M  —  C-0^1  [87],  and  it  is  well-defined  for 
all  £  G  C  that  do  not  coincide  with  an  eigenvalue  of  M.  If  M  has  no  eigenvalue  with 
magnitude  equal  to  r/,  then  we  have  by  the  Cauchy  residue  formula  that  the  projector 
onto  the  invariant  subspace  of  a  matrix  M  corresponding  to  all  singular  values  smaller 
than  ij  is  given  by 

Pm,t,  =  ^  <f  {M  -  (ir'dC,  (B.l) 

J  Crj 

where  Cr/  denotes  the  positively-oriented  circle  of  radius  r]  centered  at  the  origin.  Sim¬ 
ilarly,  we  have  that  the  weighted  projection  onto  the  smallest  singular  values  is  given 
by 

Pm,v  =  MPm,v  =  ^<f  C  (M  -  ar^C,  (B.2) 

J  Cr\ 

Suppose  that  M  is  a  low-rank  matrix  with  smallest  non-zero  singular  value  a,  and 
let  A  be  a  perturbation  of  M  such  that  ||  A || 2  <  k  <  |.  We  have  the  following  identity 
for  any  |£|  =  k.  which  will  be  used  repeatedly: 

\{M + a)  -  ar1  -  [m  -  ar1  =  -\m-  c/]_ia[(m + a)  -  c/r1.  (b.3) 
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We  then  have  that 


Pm+ a,k  ~  Pm,k  =  2k  KM  +  A)  -  CO"1  -  [M  - 

J  Ck, 

=  <f  [M  -  Cl]-1  A[(M  +  A)  -  CI]-^ C. 
Similarly,  we  have  the  following  for  K: 

r  <f  c  {i(M  +  A)  -  ar1  -  [m  -  (i}-1}  dc 

1  JcK 

=  2 ti£C  {[M-Cq-'&W  +  A )-CI]~1}dC 


TDW  TJW 

rM+ A,k  ~  rM,K 


-1 

2vri  Jc 


=  if  c [M-ar'AiM-ar^c 


(B.4) 


'i  f  C  —  C4]_1A[Af  —  C7]_1A[(M  +  A)  —  CI]~1dC- 

J  CK 


(B.5) 


Given  these  expressions,  we  have  the  following  two  results. 


Proposition  B.1.1.  Let  M  6  Mpxp  be  a  rank-r  matrix  with  smallest  non-zero  singidar 
value  equal  to  a,  and  let  A  be  a  perturbation  to  M  such  that  | j  A 1 1 2  <  §  with  k  <  §. 
Then  we  have  that 


\\Pm+a,k  ~  Pm,k || 2  < 


K 


(<7-K)(ff-  f) 


I A I 


Proof:  This  result  follows  directly  from  the  expression  (B.4),  and  the  sub-multiplicative 
property  of  the  spectral  norm: 


\\Pm+A,k  -  Pm,k\\-2  <  —  27T  k  — - -  || A || 2 


27 r 


a  —  k 


3k 


(a-K)(a~  f) 


I A  | 


Here,  we  used  the  fact  that  ||[M  — £7]  1 1 1 2  <  yzy:  and  ||  [(M  +  A)  —  £7]  1 1 1 2  <  for 

ICI  =  «.  □ 

Next,  we  develop  a  similar  bound  for  P^  K-  Let  U(M )  denote  the  invariant  subspace  of 
M  corresponding  to  the  non-zero  singular  values,  and  let  Pjj(M)  denote  the  projector 
onto  this  subspace. 
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Proposition  B.1.2.  Let  M  G  MPxp  be  a  rank-r  matrix  with  smallest  non-zero  singular 
value  equal  to  a,  and  let  A  be  a  perturbation  to  M  such  that  1 1 A 1 1 2  <  §  with  k  <  §. 
Then  we  have  that 

\\Pm+a,k  -  Pm,.  Pu{M))HI  ~  Pu(M))h  <  {(J  _  ^  HAHi- 

Proof:  One  can  check  that 

jf  C  [M  -  a]_1A [M  -  (Ij-'dC  =  (/  -  Pu{M))A(I  -  P\j (M) ) • 

Next  we  use  the  expression  (B.5),  and  the  sub-multiplicative  property  of  the  spectral 
norm: 


II  TDW  _  -pW 

II-*  M+A,ac 

< 


{I  ~  Pu(M))A{I  -  Pu(M))  II2 

1  1  „  A  „  1 

—  2n  K  K  -  A  2  - 

27 r  a  —  n  a  —  k 


A||a 


1 


"  "2‘ 

As  with  the  previous  proof,  we  used  the  fact  that  \\[M  —  1  ||o  <  and  ||[(M  + 

A)  -  <  '  for  \(\  =  k.  □ 

2  . 

We  will  use  these  expressions  to  derive  bounds  on  the  “twisting”  between  the  tangent 
spaces  at  M  and  at  M  +  A  with  respect  to  the  rank  variety. 


■  B.2  Curvature  of  Rank  Variety 

For  a  symmetric  rank-r  matrix  M,  the  projection  onto  the  tangent  space  T(M )  (re¬ 
stricted  to  the  variety  of  symmetric  matrices  with  rank  less  than  or  equal  to  r)  can  be 
written  in  terms  of  the  projection  Pu(m)  onto  the  row  space  U(M).  For  any  matrix  N 

Pt(M){N )  =  Pu(M)N  +  NPu(m)  -  Pu(M)NPu(M)- 

One  can  then  check  that  the  projection  onto  the  normal  space  T(M)1- 

Pr(Mp(N)  =  [!  ~  Pt(m)](n)  =  (I  ~  Pjj (m) )  N  (I  —  Pu(m))- 

Proof  of  Proposition  4.2.1:  For  any  matrix  N ,  we  have  that 

[Pt(m+A)  ~  Pt{m)\{N )  = 

[PjJ{M+ A)  -  Pu(M)\  N  [I  —  Pjj(M)]  +  [I  -  Pu{M+ A)]  N  [Pu(M+ A)  ~  Pu(M)]- 
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Further,  we  note  that  for  k  <  | 

Pu(M+ A)  -  =  [-f  -  PjJ(M)]  -  [I  -  Pu(M+ A)] 

—  Pm,k  Pm+ A,kj 

where  Pm,k  is  defined  in  the  previous  section.  Thus,  we  have  the  following  sequence  of 
inequalities  for  k  = 

p(T(M  +  A),T(M))  =  max  ||  [P[/(M+A)  _  Pu{M)]  N  [I  ~  Pu{M)\ 

\\N\\2<1 

+  [I  ~  A/(M+A)]  -W  [-Pc/(M+A)  ~  Pjj (M) ]  1 1 2 

-  „S,aX  \\[Pu(M+A)  ~  Pu(M)\  N  [I  -  Pjj(M)]  h 

\\N\\2<1 

+  n™ax  \\\I  ~  Pu(\l+A)]  N  [Pu(M+A)  ~  Pu{M)]\\'2 

\\N\\2<1 

<  2  ||Pm+a,|  -  Pm,  fib 

2  I.  .  I. 

<  -  A  2, 

a 

where  we  obtain  the  last  inequality  from  Proposition  B.1.1.  □ 

Proof  of  Proposition  4.2.2:  Since  both  M  and  M  +  A  are  rank-r  matrices,  we 
have  that  V^+a  k  =  Pm  k  =  0-  Consequently, 

\\Pt{m)-l  (A) 1 1 2  =  II (/  -  Pu{M))  A  (/  —  PU{M))h 
II  A  || 2 

<  iL-^, 

a 

where  we  obtain  the  last  inequality  from  Proposition  B.1.2  with  k  =  □ 

Proof  of  Lemma  4.3.1:  Since  p(Ti,  T2)  <  1  one  can  check  that  the  largest  principal 
angle  between  T\  and  T2  is  strictly  less  than  | .  Consequently,  the  mapping  Vt2  '■  Pi  —• > 
T2  restricted  to  Ti  is  bijective  (as  it  is  injective,  and  the  spaces  Tj,T2  have  the  same 
dimension).  Consider  the  maximum  and  minimum  gain  of  the  operator  Vt2  restricted 
to  Ti;  for  any  M  £  Ti,  1 1 AT 1 1 2  =  1: 

=  ||M+[PT2-PTl](M)||2 
£  [1  -  p(Ti,T2),  1  +  p(T1,T2)\. 


\\Pt2(M)\\2 
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Therefore,  we  can  rewrite  £ (T2)  as  follows: 


zm  = 


< 

< 


max 

JVeT2,||iV||2<i 

max  1 1  TVs,  (N)  II 00 

JVeT2l||iV||2<i 

max  ||^r2(Ar)||c 


JVeTi, 


max 

< 


+  II  [^Ti 


2S 


l-p(Ti,T2) 


^T2](iV)||oo] 


< 


< 

< 


1 


1  -  p(Ti,T2) 

1 

1  -  p(Ti,T2) 

1 

1  -  p(Ti,T2) 


f(Tl)  +  «rS,aS1 


Sffi)  +  max  ||[pn-PTl](JV)||2 
Mv  h<i 


Km)  +  p(Ti,t2)]. 


This  concludes  the  proof  of  the  lemma.  □ 


■  B.3  Transversality  and  Identifiability 

Proof  of  Lemma  4.3.3:  We  have  that  A^A(S,L)  =  (S  +  L,  S  +  L );  therefore, 
PyA*AVy(S,L)  =  (S  +  Vn(L),VT{S)  +  L ).  We  need  to  bound  || S  +  Vn{L) ||oo  and 
|| Vt(S)  +  Z/|| 2 -  First,  we  have 

l|S  +  7>n(L)||oo  e  [I^IU-llPnWIUJISIU  +  UPn^lU] 

[ll'S'lloo  —  Halloo;  ||5||oo  +  ||F||oo] 

c  h-ani +z(t)}. 

Similarly,  one  can  check  that 

\\Vt(S)  +  L\\2  €  [-||Pt(5)||2  +  ||L||2,||Pt(5)||2  +  ||L||2] 

C  [1  -  2||<S'||2, 1  +  2||5||2] 

C  [l-27ju(f2),l  +  27ju(fl)]. 

Thus,  we  can  conclude  that 

9^VyA'AVy{S,  L))  €  [1  —  x(fi,  r,  7),  1  +  x(fi,  T,  7)]. 
where  %(f 2,  T, 7)  is  defined  in  (4.7).  □ 
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Proof  of  Proposition  4.3.1:  Before  proving  the  two  parts  of  this  proposition  we 
make  a  simple  observation  about  £(T')  using  the  condition  that  p(T,T')  <  : 


eeo  < 


< 


gT)  +  P(T,r) 

i  ~P(T,T) 

3  g(T) 

2 


1  - 


<  3C(T). 

Here  we  used  the  property  that  £(T)  <  1  in  obtaining  the  final  inequality.  Consequently, 
noting  that  7  G  [3^(2~^(T),  2/?(2_[(°)Mn)  ]  implies  that 


7)  =  max 


|?(T') 


7 


,  2^(^)7  f  < 


i/a 


/5(2  -  v) 


(B.6) 


Part  1:  The  proof  of  this  step  proceeds  in  a  similar  manner  to  that  of  Lemma  4.3.3. 
First  we  have  for  S  €  L  G  T'  with  ||S'||oo  =  7:  ||L||2  =  1: 

WVnTiS  +  L)^  >  WPaTSWoo  -  ||Pn:TL||oo 

>  07-HXUIU 

>  07  ~  K{T')- 

Next  under  the  same  conditions  on  S,  L , 

||7VZ*(S  +  L)||2  >  ||PTdTL||2  -  ||PT,Z*S||2 

>  a  —  2||X*5||2 

>  a  —  2/3p(fl)'y. 

Combining  these  last  two  bounds  with  (B.6),  we  conclude  that 

g1(VyA^fl*AVy(S,  L))  >  a  —  j3  max  |  — — - ,  2/t(ST)7 


mm 

(s,L)ey,  ||S||oo=7.  ||i||a=i 


>  a  — 


va 


2  —  v 
2a(l  —  u) 


2-i/ 


>  - 


a 

2’ 


where  the  final  inequality  follows  from  the  assumption  that  v  E  (0,  |]. 
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Part  2:  Note  that  for  S  €  fi,L  £  T'  with  ||S'||00  <  7,  ||L||2  <  1 

\\Pn±T(S  +  L)^  <  WV^TSWoo  +  WV^TLWoo 

<  h  +  fc(T'). 

Similarly 

\\vT,±r(s  +  l)\\2  <  \\vT,±rs\\2  +  \\vt,±tl\\2 

<  5  +  P^n(^l). 

Combining  these  last  two  bounds  with  the  bounds  from  the  first  part,  we  have  that 

5  + /3  max  j^p,  2^)7} 


V 


'y±A^TAVy  ( VyA]X*AVy ) 


-1 


< 

a-  f3  max  |  , 


< 


< 


< 5  +  ™L 
u  ^  2—u 


a  — 


(1-2  v)a  +  ^- 


a  — 


2—v 


=  1-12. 


This  concludes  the  proof  of  the  proposition.  □ 


■  B.4  Proof  of  Main  Result 


Here  we  prove  Theorem  4.4.1.  Throughout  this  section  we  denote  in  =  max{l,  j-}. 
Further  Q  =  Q(Kq)  and  T  =  T (Kq  h{K*h)~1  K*h  q)  denote  the  tangent  spaces  at  the 
“true”  sparse  matrix  S*  =  Kq  and  low-rank  matrix  L*  =  Kq  h (K*h)~1K*h  q.  We 
assume  that 


3/3(2  -  v)£(T) 


1  [  12a  ’  2/3(2  -v)n(Q)\  1  ’ 

We  also  let  En  =  Y7q  —  Xq  denote  the  difference  between  the  true  marginal  covariance 
and  the  sample  covariance.  Finally  we  let  D  =  max{l,  37^77}  throughout  this  section. 
For  7  in  the  above  range  we  note  that 


m  < 


D 

Wr 


(B.8) 


Standard  facts  that  we  use  throughout  this  section  are  that  £(T)  <  1  and  that  HMUqo  < 
||  M  ||  2  for  any  matrix  M. 
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We  study  the  following  convex  program: 


(Sn,  Ln)  =  arg  min  Tr[(<S  —  L)  Sq]  —  log  det(S'  —  L)  +  Ari[T||5'||1  +  ||Z,|U] 

OfLj 

s.t.  S  —  L  y  0. 


(B.9) 


Comparing  (B.9)  with  the  convex  program  (4.9),  the  main  difference  is  that  we  do  not 
constraint  the  variable  L  to  be  positive  semidefinite  in  (B.9)  (recall  that  the  nuclear 
norm  of  a  positive  semidefinite  matrix  is  equal  to  its  trace).  However  we  show  that  the 
unique  optimum  ( Sn,Ln )  of  (B.9)  under  the  hypotheses  of  Theorem  4.4.1  is  such  that 
Ln  y  0  (with  high  probability).  Therefore  we  conclude  that  ( Sn,Ln )  is  also  the  unique 
optimum  of  (4.9).  The  subdifferential  with  respect  to  the  nuclear  norm  at  a  matrix  M 
with  (reduced)  SVD  given  by  M  =  UDVT  is  as  follows: 

Ned\\M\U  ^  VT{M)(N)  =  UVT ,  \\VT(M)^N)\\2  <  1. 

The  proof  of  this  theorem  consists  of  a  number  of  steps,  each  of  which  is  analyzed 
in  separate  sections  below.  We  explicitly  keep  track  of  the  constants  cc,  (3,  o,  i/j.  The  key 
ideas  are  as  follows: 


1.  We  show  that  if  we  solve  the  convex  program  (B.9)  subject  to  the  additional 
constraints  that  SI  and  LgT'  for  some  T'  “close  to”  T  (measured  by  p(T' ,  T)), 
then  the  error  between  the  optimal  solution  ( Sn ,  Ln)  and  the  underlying  matrices 
( S*,L *)  is  small.  This  result  is  discussed  in  Appendix  B.4.2. 

2.  We  analyze  the  optimization  problem  (B.9)  with  the  additional  constraint  that 
the  variables  S  and  L  belong  to  the  algebraic  varieties  of  sparse  and  low-rank 
matrices  respectively,  and  that  the  corresponding  tangent  spaces  are  close  to  the 
tangent  spaces  at  ( S*,L *).  We  show  that  under  suitable  conditions  on  the  min¬ 
imum  nonzero  singular  value  of  the  true  low-rank  matrix  L*  and  on  the  mini¬ 
mum  magnitude  nonzero  entry  of  the  true  sparse  matrix  S* ,  the  optimum  of  this 
modified  program  is  achieved  at  a  smooth  point  of  the  underlying  varieties.  In 
particular  the  bound  on  the  minimum  nonzero  singular  value  of  L*  helps  bound 
the  curvature  of  the  low-rank  matrix  variety  locally  around  L*  (we  use  the  results 
described  in  Appendix  B.2).  Further  we  also  show  that  the  tangent-spaces  at  the 
solution  to  this  variety-constrained  problem  are  close  to  the  tangent  spaces  at  the 
true  underlying  matrices  (S* ,  L*).  These  results  are  described  in  Appendix  B.4.3. 
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3.  The  next  step  is  to  show  that  the  variety  constraint  can  be  linearized  and  changed 
to  a  tangent-space  constraint  (see  Appendix  B.4.4),  thus  giving  us  a  convex  pro¬ 
gram.  Under  suitable  conditions  this  tangent-space  constrained  program  also  has 
an  optimum  that  has  the  same  support /rank  as  the  true  (S*,L*).  Based  on  the 
previous  step  these  tangent  spaces  in  the  constraints  are  close  to  the  tangent 
spaces  at  the  true  ( S*,L *).  Therefore  we  use  the  first  step  to  conclude  that  the 
resulting  error  in  the  estimate  is  small. 

4.  Finally  we  show  that  under  the  identifiability  conditions  of  Section  4.3  these 
tangent-space  constraints  are  inactive  at  the  optimum  (see  Appendix  B.4.7). 
Therefore  we  conclude  with  the  statement  that  the  optimum  of  the  convex  pro¬ 
gram  (B.9)  without  any  variety  constraints  is  achieved  at  a  pair  of  matrices  that 
have  the  same  support /rank  as  the  true  (S* ,  L*)  (with  high  probability).  Further 
the  low-rank  component  of  the  solution  is  positive  semidefinite,  thus  allowing  us 
to  conclude  that  the  original  convex  program  (4.9)  also  provides  estimates  that 
are  consistent. 


■  B.4.1  Bounded  curvature  of  matrix  inverse 

Consider  the  Taylor  series  of  the  inverse  of  a  matrix: 

(Af  +  A)-1  =  Af”1  -  AT1  AM-1  +  RM- i(A), 


where 

OO 

^(-AM-y  . 

,fc= 2 

This  infinite  sum  converges  for  A  sufficiently  small.  The  following  proposition  provides 
a  bound  on  the  second-order  term  specialized  to  our  setting: 


Rm- i(A)  =  AT1 


Proposition  B.4.1.  Suppose  that  7  is  in  the  range  given  by  (B.7).  Let  g^(As,  Al)  < 
2^7  for  Ci  =  ip(  1  +  fA),  and  for  any  (As,  A l)  with  A s  £  Ll .  Then  we  have  that 


g1(A^ Rt,*0A( As,Al))  < 


2DifC‘lg1(As,  Al)2 

f(T) 
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Proof:  We  have  that 


||-4(As,  Al)||2 


<  ||As||2  +  ||Al||2 

I|AS|| 


7 


+  I|Al||2 


<  (1  +  7/x(0))57(As,  Al) 


a 


<  (!  +  ^)sb(As,  Al) 


< 


1 

20’ 


where  the  second-to-last  inequality  follows  from  the  range  for  7  (B.7),  and  the  final 
inequality  follows  from  the  bound  on  g7(As,  As).  Therefore, 


(As,  Al))||2  < 

< 

< 


+  ALh^)' 

k= 2 

V’3||  As  +  Ax,  |||“ 


203(1  +  — )257(As,  Al)2 


1  -  II As  +  As||20 
'(A 

2'tpClg1{ASlAL)2. 


Here  we  apply  the  last  two  inequalities  from  above.  Since  the  |j  •  |  loo-norm  is  bounded 
above  by  the  spectral  norm  ||  •  ||2,  we  have  the  desired  result.  □ 


■  B.4.2  Bounded  errors 


Next  we  analyze  the  following  convex  program  subject  to  certain  additional  tangent- 
space  constraints: 


(Sq,LTi)  =  argrnin  Tr[(S  -  L)  Eg]  -  logdet(5  -  L)  +  A„, [7 1 1 -S' 1 1 1  +  ||L||*] 

0,-L/ 

s.t.  S-Ly  0,  Sen,  LeT', 


(B.10) 


for  some  subspace  T' .  We  show  that  if  T'  is  any  tangent  space  to  the  low-rank  matrix 
variety  such  that  p(T,T')  <  then  we  can  bound  the  error  (As,  As)  =  (Sq  — 

S*,L*  —  Ll> ) ■  Let  Ct'  =  VTi±(L*)  denote  the  orthogonal  component  of  the  true  low- 
rank  matrix,  and  recall  that  En  =  E g  —  Eg  denotes  the  difference  between  the  true 
marginal  covariance  and  the  sample  covariance.  The  proof  of  the  following  result  uses 
Brouwer’s  fixed-point  theorem  [113],  and  is  inspired  by  the  proof  of  a  similar  result 
in  [119]  for  standard  sparse  graphical  model  recovery  without  latent  variables. 
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Proposition  B.4.2.  Let  the  error  (As,  A/,)  in  the  solution  of  the  convex  program 
(B.10)  (with  T'  such  that  p(T',T )  <  ^p)  be  as  defined  above.  Further  let  C\  = 
if(\  +  jrs),  and  define 


r  =  max 


a  l 


gj(A^En)  +  g1(A^F*CT')  +  An  ,  ((Cr'lb  f  • 


If  we  have  that 


r  <  min 


J_  at(T)  \ 
4C±  ’  64 DtfCf  J  ’ 


for  7  in  the  range  given  by  (B.7),  then 


ff7( As,  Al)  <  2r. 

Proof:  Based  on  Proposition  4.3.1  we  note  that  the  convex  program  (B.10)  is 
strictly  convex  (because  the  negative  log-likelihood  term  has  a  strictly  positive-definite 
Hessian  due  to  the  constraints  involving  transverse  tangent  spaces),  and  therefore 
the  optimum  is  unique.  Applying  the  optimality  conditions  of  the  convex  program 
(B.10)  at  the  optimum  (5n,  Lt>),  we  have  that  there  exist  Lagrange  multipliers  Qq±  G 
Qp  Qt,_ l  g  T,l  such  that 

—  ('S'n  —  Lj"/)  1  +  ^  —  AnT^II'S'nlli,  LIS  —  (Sq  —  Lt1)  1  +  Qt'x  G  Atic?||Ltv||*- 

Restricting  these  conditions  to  the  space  y  =  Q  x  T' ,  one  can  check  that 

PoPS  -  (5n  -  Lt')  X]  =  Za,  TV  PS  -  (5n  -  Lt')-1]  = 

where  Zq  G  U,  Zt'  G  T'  and  |po||oo  =  Any,  IpT'lh  <  2An  (we  use  here  the  fact  that 
projecting  onto  a  tangent  space  T'  increases  the  spectral  norm  by  at  most  a  factor  of 
two).  Denoting  Z  =  [Zq.  Zt>],  we  conclude  that 

VyA*[?%-(Sn-LT,)-1]  =  Z,  (B.ll) 

with  g1(Z)  <  2An.  Since  the  optimum  (5o,Lt')  is  unique,  one  can  check  using  La- 
grangian  duality  theory  [124]  that  (Sq,Lt')  is  the  unique  solution  of  the  equation 
(B.ll).  Rewriting  Eq  —  ( Sq  —  L^/)-1  in  terms  of  the  errors  (As,  Al),  we  have  using 
the  Taylor  series  of  the  matrix  inverse  that 

sS-OSo-Lt')-1  =  es-PKAsjA^  +  ps)-1]-1 

=  En-Rj:h(A(As,AL))+l*A(As,AL) 

=  En  -  Rzh(A(A3,AL))  +l*AVy{ As,Al)  +  1*Ct(B.12) 
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Since  T'  is  a  tangent  space  such  that  p{T' ,  T )  <  ^p-,  we  have  from  Proposition  4.3.1 
that  the  operator  B  =  (VyA^Z*AVy)  1  from  y  to  y  is  bijective  and  is  well-defined. 
Now  consider  the  following  matrix-valued  function  from  {5s, 5l)  £  y  to  y-. 

F{5S,  5l )  =  {5S,  5l)-B  {VyA][En  -  R^h{A{6s,  5L  +  Cr>))  +TAVy{5s ,  5l)+TCt,]  -  Z 

A  point  {5s,  5l)  £  y  is  a  fixed-point  of  F  if  and  only  if  VyAjf[En—R^{A{5s,  5l+Ct'))  + 
l*AVy{5s,5L)+l*CT']  =  Z.  Applying  equations  (B.ll)  and  (B.12)  above,  we  then  see 
that  the  only  fixed-point  of  F  by  construction  is  the  “true”  error  Vy{As,  Al)  restricted 
to  y.  The  reason  for  this  is  that,  as  discussed  above,  {Sq,Lt')  is  the  unique  optimum 
of  (B.10)  and  therefore  is  the  unique  solution  of  (B.ll).  Next  we  show  that  this  unique 
fixed-point  of  F  lies  in  the  ball  Br  =  {(5s,  5l)  |  g7(5s,  5l)  <  r,  {5s,  5l)  £  T}- 

In  order  to  prove  this  step,  we  resort  to  Brouwer’s  fixed  point  theorem  [113].  In 
particular  we  show  that  the  function  F  maps  the  ball  Br  onto  itself.  Since  F  is  a  con¬ 
tinuous  function  and  Br  is  a  compact  set,  we  can  conclude  the  proof  of  this  proposition. 
Simplifying  the  function  F,  we  have  that 

F{5S,  5l)  =  B  [VyA^-En  +  Rxh  {A{5S,  5L  +  CT'))  -  TCt >]  +  zj  . 


Consequently,  we  have  from  Proposition  4.3.1  that 


97{F{5s,5l))  < 
< 
< 


l  g-y  (vyA][En  -  Rxh{A{5s,  5l  +  CT,))  +  TCt,}  -  Z^j 
—  | g1{A^[En  -  R^*0{A{5s,5l  +Ct>))  +1*Ct '])  +  An  j 
x  +  -  g^{A^ Rt.*0{A{5s,  5l  +  Ct'))), 

Z  CM, 


where  in  the  second  inequality  we  use  the  fact  that  g7{Vy (•,•))  <  2 g7{-,-)  and  that 
g7{Z)  <  2An,  and  in  the  final  inequality  we  use  the  assumption  on  r. 

We  now  focus  on  the  term  g7{A^ R^{A{5s,5l))): 


-  9^{A^ Ry.*  {A{5s,5l  +Ct')))  < 

a  ° 

< 

< 

< 


8D^Cf{g7{5s,5L)  +  \\CT'h)2 
i{T)a 

32  D^Cfr2 

“W 

32  DipCfr  a£(T) 
i{T)a  64Ehfcf 
r 
2’ 
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where  we  have  used  the  fact  that  r  <  Hence  g1(Vy(As,  A l))  <  r  by  Brouwer’s 

fixed-point  theorem.  Finally  we  observe  that 

57(A5,Al)  <  gi(Vy(As,  AL))  +  ||CT/||2 
<  2  r. 


□ 


■  B.4.3  Solving  a  variety-constrained  problem 

In  order  to  prove  that  the  solution  ( Sn ,  Ln)  of  (B.9)  has  the  same  sparsity  pattern/rank 
as  ( S* ,  L *),  we  will  study  an  optimization  problem  that  explicitly  enforces  these  con¬ 
straints.  Specifically,  we  consider  the  following  non-convex  constraint  set: 


M  =  {( S,L )  |  5  6  n(S*),  rank(L)  <  rank(L*), 

\\VTx(L  -  L*)h  <  gi(A^X*A{S  -  S*,L*  -  L))  <  llAn} 

Recall  that  S*  =  Kq  and  L*  =  Kq  h{K*h)~1K*h q.  The  first  constraint  ensures  that  the 
tangent  space  at  S  is  the  same  as  the  tangent  space  at  S*\  therefore  the  support  of  S  is 
contained  in  the  support  of  S*.  The  second  and  third  constraints  ensure  that  L  lives  in 
the  appropriate  low-rank  variety,  but  has  a  tangent  space  “close”  to  the  tangent  space 
T.  The  final  constraint  roughly  bounds  the  sum  of  the  errors  (S  —  S*)  +  (L*  —  L);  note 
that  this  does  not  necessarily  bound  the  individual  errors.  Notice  that  the  only  non- 
convex  constraint  is  that  rank(L)  <  rank(L*).  We  then  have  the  following  nonlinear 
program: 


=  argmin  Tr[(5  —  L)  SgJ-logdet {S  -  L)  +  An[7||S'||1  +  ||L||*] 
s.t.  S-Ly  0,  (S,L)  €  M. 


(B.13) 


Under  suitable  conditions  this  nonlinear  program  is  shown  to  have  a  unique  solution. 
Each  of  the  constraints  in  M.  is  useful  for  proving  the  consistency  of  the  solution  of 
the  convex  program  (B.9).  We  show  that  under  suitable  conditions  the  constraints  in 
A4  are  actually  inactive  at  the  optimal  (Sm>  Am),  thus  allowing  us  to  conclude  that 
the  solution  of  (B.9)  is  also  equal  to  (S’x,Lx);  hence  the  solution  of  (B.9)  shares  the 
consistency  properties  of  (Sjh,  Am)-  A  number  of  interesting  properties  can  be  derived 
simply  by  studying  the  constraint  set  AT 
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Proposition  B.4.3.  Consider  any  ( S,L )  G  M.,  and  let  Ag  =  S—S*,Al  =  L*  —  L.  For 
7  in  the  range  specified  by  (B.7)  and  letting  C2  =  ^  +  77,  we  have  that  g7(As,  A^)  < 
C2Xn- 

Proof:  We  have  by  the  triangle  inequality  that 

g7(A^l*A(Va(As),VT(AL)))  <  llAn  +  g^{A]T  A(V^{As),VTr{AL))) 

<  llAn  +  rmp2\\VT±(AL)\\2 

<  12An, 

as  m  <  Therefore,  we  have  that  g1('PyA^Z*AVy(As,  A^))  <  24An,  where  y  = 

f l  x  T.  Consequently,  we  can  apply  Proposition  4.3.1  to  conclude  that 

9i{'Py(As,  Al))  <  _8^T. 

a 

Finally,  we  use  the  triangle  inequality  again  to  conclude  that 

g1(As,AL)  <  g1((Py(As,  AL))  +  g~/('Py±(As,  AL)) 

48  A 

<  - ^+m||PTx(AL)||2 

a 

<  C*2  An- 


□ 

This  simple  result  immediately  leads  to  a  number  of  useful  corollaries.  For  exam¬ 
ple  we  have  that  under  a  suitable  bound  on  the  minimum  nonzero  singular  value  of 
L*  =  the  constraint  in  Ai  along  the  normal  direction  T is  locally 

inactive.  Next  we  list  several  useful  consequences  of  Proposition  B.4.3. 

Corollary  B.4.1.  Consider  any  ( S,L )  G  M.,  and  let  A s  =  S  —  S* ,  Al  =  L*  —  L. 
Suppose  7  is  in  the  range  specified  by  (B.7),  and  let  C3  =  ^  C^^D  and 

C4  =  C2  +  (where  C2  is  as  defined  in  Proposition  B.4.3).  Let  the  minimum 

nonzero  singular  value  a  of  L*  =  Kq  h{K*h)~1  K*h  0  be  such  that  a  >  for  C'5  = 
inax{C3,  C4},  and  suppose  that  the  smallest  magnitude  nonzero  entry  of  S*  is  greater 
than  for  Cq  =  •  Setting  T’  =  T(L)  and  Ct1  =  VT/±(L*),  we  then  have  that: 

1.  L  has  rank  equal  to  rank(L*),  i.e.,  L  is  a  smooth  point  of  the  variety  of  matrices 
with  rank  less  than  or  equal  to  rank(L*).  In  particular  L  has  the  same  inertia  as 
L*. 
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&  I|7V(Al)||2<S|?. 

5.  p(T,  T')  <  «p. 

4-  g7(A^l*CT')  < 

cr  \\p  II  /  16(3— v)Xn 

5-  ||Ct'||2  <  3a(2—i/)  ■ 

6.  sign(S)  =  sign(5*). 

Proof:  We  note  the  following  facts  before  proving  each  step.  First  C2  >  ^  > 
— ^5  >  ppps-  Second  £(T)  <  1.  Third  we  have  from  Proposition  B.4.3  that  ||A/,||2  < 
C2Xn.  Finally  6^2~^  >  18  for  1/  €  (0,  |].  We  prove  each  step  separately. 

For  the  first  step,  we  note  that 

>  ^3^11  >  19Cg’02DAr),  19C2Ara  „  .  .  || 

—  £(T)2  ~  £(T)2  —  ^(T1)  —  ^^2^n  —  L"2' 

Hence  L  is  a  smooth  point  with  rank  equal  to  rank(L*),  and  specifically  has  the  same 
inertia  as  L* . 

For  the  second  step,  we  use  the  fact  that  a  >  8 1|  || 2  to  apply  Proposition  4.2.2: 

iipt±(al)|i  <  ami  <  mim  < 

11  T  1  L)l1  ~  a  ~  C3Xn  ~  1  9Di/j2 

For  the  third  step  we  apply  Proposition  4.2.1  (by  using  the  conclusion  from  above 
that  a  >  8 1 1 A^  1 1 2)  so  that 

„  2||Ai||2  „  2 Ci&T?  „  2 t(Tf  „  ((T) 

P(T.T)  <  -  <  —  <  -  —■ 

For  the  fourth  step  let  o'  denote  the  minimum  singular  value  of  L.  Consequently, 


,  ^  C3A, 


—  C2  An  >  C2  \r 


19C2D^2  1 

—7— - 1  >  8 1 1  Ax,  1 1 2  - 


Using  the  same  reasoning  as  in  the  proof  of  the  second  step,  we  have  that 


||CHl2<^< 


Cf  A2 


C2J(T)2Xn  ^  v£(T)  Xr 


a'  (Jk2-C2)Xn  (72^2(6^1)  6(2  -v)D^2 


Hence 


g7(A^l*CT')  <  rm/j2\\CT'\\2  < 


6(2  -  i/)‘ 
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For  the  fifth  step  the  bound  on  a'  implies  that 

/  ^  C±Xn  r~<  \  ^  3(7|a(2  —  v) 

-  em 2  C2Xn-  i6(3-i/)  Xn 

Since  o'  >  8||  Ax, || 2 ,  we  have  from  Proposition  4.2.2  and  some  algebra  that 

||/i  11  <  16(3  —  u)\n 

T  2  —  a'  ~  3q,(2  —  u) 


For  the  final  step  since  HAsHoo  <  7C<2Ar),,  the  assumed  lower  bound  on  the  minimum 
magnitude  nonzero  entry  of  S*  guarantees  that  sign(S')  =  sign(5*).  □ 

Notice  that  this  corollary  applies  to  any  ( S,L )  £  A4,  and  is  hence  applicable  to 
any  solution  (Sm,Lm)  of  the  A4-constrained  program  (B.13).  For  now  we  choose  an 
arbitrary  solution  (.Sat,  T.at)  and  proceed.  In  the  next  steps  we  show  that  ( Sat,  Tat)  is 
the  unique  solution  to  the  convex  program  (B.9),  thus  showing  that  {Sm-,Lm)  is  als° 
the  unique  solution  to  (B.13). 


■  B.4.4  From  variety  constraint  to  tangent-space  constraint 

Given  the  solution  (S.at,  Lm),  we  show  that  the  solution  to  the  convex  program  (B.10) 
with  the  tangent  space  constraint  L  £  Tat  —  T(Lm)  is  the  same  as  (Sat,  Aat)  under 
suitable  conditions: 


(■ Su,LTm )  =  argmin  Tr[(S  -  L)  Eg]  -  logdet(S  —  L)  +  An[7||S||i  +  ||T||*] 

f 

s.t.  S-Ly  0,  S  £  Cl,  L  £  Tm. 


(B.14) 


Assuming  the  bound  of  Corollary  B.4.1  on  the  minimum  singular  value  of  L*  the 
uniqueness  of  the  solution  (Sq,  Ltm)  is  assured.  This  is  because  we  have  from  Proposi¬ 
tion  4.3.1  and  from  Corollary  B.4.1  that  Z*  is  injective  on  CEBTm-  Therefore  the  Hessian 
of  the  convex  objective  function  of  (B.14)  is  strictly  positive-definite  at 

We  let  Cm  =  VT±(L*).  Recall  that  En  =  E q  —  E£>  denotes  the  difference  between 
the  sample  covariance  matrix  and  the  marginal  covariance  matrix  of  the  observed  vari¬ 
ables. 


Proposition  B.4.4.  Let  7  be  in  the  range  specified  by  (B.7).  Suppose  that  the  minimum 
nonzero  singular  value  o  of  L*  =  Kq  h(K*h)~1IC*h  q  is  such  that  o  >  (C5  is 

defined  in  Corollary  B.f.l).  Suppose  also  that  the  minimum  magnitude  nonzero  entry 
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of  S*  is  greater  than  or  equal  to  (C%  is  defined  in  Corollary  B.f.l ).  Let  g1(A^  En)  < 
o{2-v)  ■  Father  suppose  that 

3a(2  -  v)  f  1  af(T)  } 

n  ~  16(3  -v)  \  4Ci  ’  64DV’Ci2  J  ' 

Then  we  have  that 

(Sn,LTM)  =  (SM,LM). 

Proof :  Note  first  that  the  condition  on  the  minimum  singular  value  of  L*  in  Corol¬ 
lary  B.4.1  is  satisfied.  Therefore  we  proceed  with  the  following  two  steps: 

1.  First  we  can  change  the  non-convex  constraint  rank(L)  <  rank(L*)  to  the  linear 

constraint  L  £  This  is  because  the  lower  bound  assumed  for  a  implies 

that  L  is  a  smooth  point  of  the  algebraic  variety  of  matrices  with  rank  less  than 
or  equal  to  rank(L*)  (from  Corollary  B.4.1).  Due  to  the  convexity  of  all  the  other 
constraints  and  the  objective,  the  optimum  of  this  “linearized”  convex  program 
will  still  be  {Smi  Lm)- 

2.  Next  we  can  again  apply  Corollary  B.4.1  (based  on  the  bound  on  a)  to  conclude 
that  the  constraint  | VT±(L  —  L*)||2  <  s^,2n  is  locally  inactive  at  the  point 

(SM,LM)- 

Consequently,  we  have  that  (S_m .  )  can  be  written  as  the  solution  of  a  convex 

program: 

{SmiLm)  =  argrnin  Tr[(5  —  L)  £q]  -  logdet(S  —  L)  +  An[7||Sj|i  +  ||L||*] 

S,L 

S.t.  S-Ly  0,  Sen,  LeTM,  (B-15) 

g1(A^l*A(S  —  S*,L*  -  L))  <  llAn. 

We  now  need  to  argue  that  the  constraint  g1{A^T*  A{S  —  S*,L*  —  L))  <  llAn  is 
also  inactive  in  the  convex  program  (B.15).  We  proceed  by  showing  that  the  solution 
(Sq,  Ltm)  of  the  convex  program  (B.14)  has  the  property  that  g1{A^T*  A{Sq  —  S* ,  L*  — 
Ltm ))  <  11  An,  which  concludes  the  proof  of  this  proposition.  We  have  from  Corol¬ 
lary  B.4.1  that  g1(A^l*CTM)  <  q^-v)  •  Since  g1(A^ En)  <  by  assumption,  one 
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can  verify  that 


8  r  i  8A 

—  An  +  +  gj(Ajfl*CTM)  <  — —  1  + 


<  min 


«  L  3(2  -v)\ 
16(3  -  u)Xn 
3a(2  -  v) 

.  f  1  aC(T) 

111111 1  4C, :  64D^Cf 


The  last  line  follows  from  the  assumption  on  An.  We  also  note  that  \\Ctm  || 2  < 
13a(2-tA)n  from  Corollary  B.4.1,  which  implies  that  1 1 C V  1 1 2  <  min  {  6  4^.2}-  Let~ 

ting  (Ag,  Ai)  =  (Sq  —  S*,L*  —  Ltm),  we  can  conclude  from  Proposition  B.4.2  that 
57 ( Al,As)  <  •  Next  we  apply  Proposition  B.4.1  (as  p7( ALlAs)  <  to 

conclude  that 

^ti^As  +  AO)  < 

2 D'lpCf  32(3  -  v)\n  a£(T) 

~  £(T)  3a(2-i/)  32 D^Cf 

<  2(3  —  ^)An 

“  3(2  -1/)  v  ; 


From  the  optimality  conditions  of  (B.14)  one  can  also  check  that, 


g^(VyA^l*AVy(As,  Al))  <  2Xn  +  g^iVyA^  R^(As  +  Al)) 

+9'y('PyAjiI*  Ctm)  +  g~f(VyA*En) 

<  2[\n  +  g1(A^R^(As  +  Al)) 

+g1{A^En)+g1{A^CTM)} 

2(3  —  v)\n 
[  3(2-i/) 

Here  we  used  (B.16)  in  the  last  inequality,  and  also  that  g1(A^X*CTM)  <  6(2-v)  (as 
noted  above  from  Corollary  B.4.1)  and  that  g1(A^En)  <  ■  Therefore, 

gi(VyA'FAVy(As,AL))  <  (B.17) 

because  v  €  (0,  |].  Based  on  Proposition  4.3.1  (the  second  part),  we  also  have  that 
9l(Vy^TAVy(As,AL))  <  (1  -  1/)^  < 


(B.18) 
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Summarizing  steps  (B.17)  and  (B.18), 

9l{A]TA{ As,Al))  <  9j(VyA^l*AVy(As,AL)) 

+gi(Vy±A'TAVy(As,  Al))  +  9l(A *TCTm) 

16An  16An  Xu 

<  - -  H - -  4 - 

3  3  6(2  -u) 

32A  Xn 

<  - 1-  — 

“  3  18 

<  11A„. 

This  concludes  the  proof  of  the  proposition.  □ 

This  proposition  has  the  following  important  consequence. 

Corollary  B.4.2.  Under  the  assumptions  of  Proposition  B.4-4  we  have  that  rank  {Ltm)  = 
rank(L*)  and  that  T(Ltm)  =  Tm .  Moreover,  Ltm  actually  has  the  same  inertia  as  L* . 
We  also  have  that  sign(Sn)  =  sign  (S'*). 

■  B.4.5  Removing  the  tangent-space  constraints 

The  following  lemma  provides  a  simple  set  of  sufficient  conditions  under  which  the 
optimal  solution  (Sq,Lxm)  of  (B.14)  satisfies  the  optimality  conditions  of  the  convex 
program  (B.9)  (without  the  tangent  space  constraints). 

Lemma  B.4.1.  Let  be  the  solution  to  the  tangent-space  constrained  convex 

program  (B.14).  Suppose  that  the  assumptions  of  Proposition  B.4-4  hold.  If  in  addition 
we  have  that 

9y(A* Rx*0A(As,  Al))  <  ^n_ 

then  (Sn,  Ltm)  is  also  the  unique  optimum  of  the  convex  program  (B.9). 

Proof:  Recall  from  Corollary  B.4.2  that  the  tangent  space  at  Ltm  is  equal  to 
T(L*).  Applying  the  optimality  conditions  of  the  convex  program  (B.14)  at  the  opti¬ 
mum  (Sq,Ltm),  we  have  that  there  exist  Lagrange  multipliers  Q^±  G  fP,  QT±  G  Tp 
such  that 

—  0ft  ~ Ltm)  1  +  Qnx  £  ~ An7d||Sn||i,  T>q  —  (Sq  —  Ltm)  1  ^  Xnd\\LTM\\*. 

Restricting  these  conditions  to  the  space  y  =  LI  x  Tj^ ,  one  can  check  that 

PopS  -  (5n  -  LtJ-1]  =  -Assign (S*),  VTm[PA0  -  (; S n  -  LTm)~1]  =  A nUVT , 
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where  Ltm  =  UDVT  is  a  reduced  SVD  of  Ltm  ■  Denoting  Z  =  [— Assign  (S'*),  A  nUVT], 
we  conclude  that 

VyA*[Y%)-(Ssl-LTM)-1]=Z,  (B.19) 

with  g-,  (Z)  =  Xn.  It  is  clear  that  the  optimality  condition  of  the  convex  program  (B.9) 
(without  the  tangent-space  constraints)  on  y  is  satisfied.  All  we  need  to  show  is  that 

gi(Vy^A][YZ0  -  (Sn  -  Lt J-1])  <  Xn.  (B.20) 

Rewriting  Sq  -  (<§n -Ltm)-1  in  terms  of  the  error  (As,  A/,)  =  (Sq-  S* ,  L*  -  LTm), 
we  have  that 


-  (Sq  -  LTm )  1  =  En  -  Rz*oA{As,  AL)  +X*A(As,  Al). 

Restating  the  condition  (B.19)  on  y,  we  have  that 

VyA]TAVy(As,AL)  =  Z  +  VyA^[-En  +  R^oA(ASlAL) -TCTm\-  (B.21) 

(Recall  that  Ctm  =  VT±  (L*).)  A  sufficient  condition  to  show  (B.20)  and  complete  the 
proof  of  this  lemma  is  that 


g1{Vy±A^l*A'Py(As,  Al))  <  Xn  -  gi(Vy±A^[—En  +  Rz*oA(As,  AL)  -  1*CTm ]). 

We  prove  this  inequality  next.  Recall  from  Corollary  B.4.1  that  g^(A^I*CTM)  <  q^-u) 
Therefore,  from  equation  (B.21)  we  can  conclude  that 


9l(VyA'FAVy(As,AL)) 


<  Xn  +  2(g-y(A^[—En  +  R^A(As,  Al) 


<  Xn  T  2 


< 


2An 
2  —  v' 


3  A  nis 
6(2  -  v) 


rCrM])) 


Here  we  used  the  bounds  assumed  on  g^(A^En)  and  on  gi(A^ Rx*0A(As,  Al))- 
Applying  the  second  part  of  Proposition  4.3.1,  we  have  that 

gy(PyxAfl'APy(As,  A£))  <  2^~1') 

<  A„-T^ 

“  2  -u 

x  vXn 

<  An  2(2  -  v) 


<  Xn  -  gi(A^\—En  +  Rz*0A(As,  Al)  -  1*Ctm}) 

<  Xn  —  (jj(Vy±A^[—En  +  Rz*QA(As,  Al)  —Z*Ctm])- 
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This  concludes  the  proof  of  the  lemma.  □ 

One  can  check  that  as  ( Sq,Ltm )  is  also  the  unique  solution  to  the  convex  program 
(B.9)  without  the  tangent-space  constraints. 


■  B.4.6  Probabilistic  analysis 

All  the  analysis  described  so  far  in  this  section  has  been  completely  deterministic  in 
nature.  Here  we  present  the  probabilistic  component  of  our  proof.  Specifically,  we 
study  the  rate  at  which  the  sample  covariance  matrix  converges  to  the  true  covariance 
matrix.  The  following  result  from  [41]  plays  a  key  role  in  our  analysis: 


Theorem  B.4.1.  Given  natural  numbers  n,p  with  p  <  n,  let  T  be  a  p  x  n  matrix 
with  i.i.d.  Gaussian  entries  that  have  zero-mean  and  variance  4.  Then  the  largest  and 
smallest  singular  values  si(T)  and  sp(T)  ofT  are  such  that 


max 


•si(T)  >  1  + 


,  Pr 


Sp(r)  <  i 


<exp{  —  rf}, 


for  any  t  >  0. 


Using  this  result  the  next  lemma  provides  a  probabilistic  bound  between  the  sample 
covariance  T,q  formed  using  n  samples  and  the  true  covariance  YTq  in  spectral  norm. 
This  result  is  well-known,  and  we  mainly  discuss  it  here  for  completeness  and  also  to 
show  explicitly  the  dependence  on  if  =  ||Sq||2  (4.8). 

Lemma  B.4.2.  Let  if  =  ||£q||2.  Given  any  5  >  0  with  5  <  8 if,  let  the  number  of 
samples  n  be  such  that  n  >  64j/  .  Then  we  have  that 

Pr[||S&  -  S5||2>5]<2exp{-Tg5}. 

Proof:  Since  the  spectral  norm  is  unitarily  invariant,  we  can  assume  that  Xq  is 
diagonal  without  loss  of  generality.  Let  Xn  =  (Xq)-2  Xq(Eq)-2  :  and  let  .si(Sn),  sp(T,n) 
denote  the  largest/smallest  singular  values  of  Xn.  Note  that  Xn  can  be  viewed  as  the 
sample  covariance  matrix  formed  from  n  independent  samples  drawn  from  a  model  with 
identity  covariance,  i.e.,  Xn  =  rrT  where  T  denotes  a  px  n  matrix  with  i.i.d.  Gaussian 
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entries  that  have  zero- mean  and  variance  We  then  have  that 

n 

Pr[||£&  -  £oll2><5]  <  Pr[||E"  -/||2>| 

<  Pr  [8l (E")  >  1  +  |]  +  Pr  [sp(E")  <  1  -  f 

=  Pr[.Si(r)2>l  +  |]+Pr[sp(r)2<l-| 

<  Pr[si(r)>l  +  ^]  +Pr[Sp(r)<l  — 

<  Pr  [ar(r)  >  1  +  y/$  ■ +  +  Pr  [sp(T)  <  1  -  ^ ^ 

^  2exp{-ii^}- 

Here  we  used  the  fact  that  n  >  64?2^  in  the  fourth  inequality,  and  we  applied  Theo¬ 
rem  B.4.1  to  obtain  the  final  inequality  by  setting  t  =  gL.  □ 

The  following  corollary  describes  relates  the  number  of  samples  required  for  an  error 
bound  to  hold  with  probability  1  —  2  exp{  —  p} . 

Corollary  B.4.3.  Let  be  the  sample  covariance  formed  from  n  samples  of  the 
observed  variables.  Set  Sn  =  ■  If  n  >  2 p,  then  we  have  with  probability  greater 

than  1  —  2exp{—  p}  that 

Pr[||Sg  -  X*0h  <Sn\  >  l-2exp{-p}. 

Proof:  We  note  that  n  >  2p  implies  that  5n  <  8ip,  and  apply  Lemma  B.4.2.  □ 


■  B.4.7  Putting  it  all  together 

In  this  section  we  tie  together  the  results  obtained  thus  far  to  conclude  the  proof  of 
Theorem  4.4.1.  We  only  need  to  show  that  the  sufficient  conditions  of  Lemma  B.4.1 
are  satisfied.  It  follows  directly  from  Corollary  B.4.2  that  the  low-rank  part  Ltm  is 
positive  semidefinite,  which  implies  that  (Sq,Ltm)  is  also  the  solution  to  the  original 
regularized  maximum-likelihood  convex  program  (4.9)  with  the  positive-semidehnite 
constraint.  As  usual  set  (As,  Al)  =  (Sq  —  S*,L*  —  Ltm),  and  set  En  =  T,q  — 

Assumptions:  We  specify  here  the  constants  that  were  suppressed  in  the  statement 
of  Theorem  4.4.1: 
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Note  that  n  >  ■ 


2.  Set  5n  =  \f  12y  ,  and  then  set  An  as  follows: 


At?,  — 


6D5n(2  —  v) 


Note  that  Xr 


l  p 

i(T)  V 


3.  Let  the  minimum  nonzero  singular  value  a  of  L*  be  such  that 

C§Xn 


a  > 


Z(T)2’ 


where  C5  is  defined  in  Corollary  B.4.1.  Note  that  a  >  g(^)3  \J f  • 

4.  Let  the  minimum  magnitude  nonzero  entry  9  of  S*  be  such  that 

Cq  An 


9  > 


where  Cq  is  defined  in  Corollary  B.4.1.  Note  that  9  > 

Proof  of  Theorem  4.4.1:  We  condition  on  the  event  that  ||_En||2  <  5n,  which 
holds  with  probability  greater  than  1  —  2exp{—  p}  from  Corollary  B.4.3  as  n  >  2p  by 
assumption.  We  note  that  based  on  the  bound  on  n,  we  also  have  that 


OLV 


■  /  1 
nn  /  - 


OLV 


Sn  ~  ^  32(3  -  v)D  mm  \  AC  1  ’  25611(3  -  i/)^C?  j 

In  particular,  these  bounds  imply  that 


Sn  < 


o£{T)v  .  f  1  a£(T)  1 

- mm  '  -  - 


and  that 


32(3  -v)D  \  AC i :  MDtpC2  / 

a2^{T)2u2 


Sn.  N 


(B.22) 

(B.23) 


8192^(7^(3  —  u)2D2 
Both  these  weaker  bounds  are  used  later. 

Based  on  the  assumptions  above,  the  requirements  of  Lemma  B.4.1  on  the  minimum 
nonzero  singular  value  of  L*  and  the  minimum  magnitude  nonzero  entry  of  S*  are 
satisfied.  We  only  need  to  verify  the  bounds  on  Xn  and  g1{A^ En)  from  Proposition  B.4.4, 
and  the  bound  on  g^(A^  RA(Ag,  Al))  from  Lemma  B.4.1. 
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First  we  verify  the  bound  on  An.  Based  on  the  setting  of  An  above  and  bound  on 
Sn  from  (B.22),  we  have  that 

6D(2  —  zv)(5n 


An  — 


<  3a(2  ~  v)  min  f  1  a^T)  X 


16(3  -  v)  mm  \  4C?  64 D^Cl  ] 

Next  we  combine  the  facts  that  A„  =  hDg('j’)n  ^  ’  anc^  that  l-^nll'2  <  5n  to  conclude 


that 


/  *  +  i—i  \  Sm  \ 77  iy 

gM]En)  <  <  n 


e(T)  -  6(2  —  v) ' 

Finally  we  provide  a  bound  on  the  remainder  by  applying  Propositions  B.4.2  and 
B.4.1,  which  would  satisfy  the  last  remaining  condition  of  Lemma  B.4.1.  In  order  to 
apply  Proposition  B.4.2,  we  note  that 


a 


g^En)+g^TCTM)  +  \ 


<  - 
a 


3(2  -  v) 
16(3  -  u)Xn 
3a(2  -  v) 
32(3  -  v)D 


+  1 


A, 


<  min 


<*Z(T)v  ^ 

1  a£(T)  \ 


(B.24) 


4Ci  ’  64 D^C\  J 


In  the  Hrst  inequality  we  used  the  fact  that  g1(A^En)  <  6^-u)  (fr°m  above)  and  that 
g1  {A^X*Ctm)  is  similarly  bounded  (from  Corollary  B.4.1  due  to  the  bound  on  a).  In  the 
second  equality  we  used  the  relation  An  =  •  In  the  final  inequality  we  used  the 

bound  on  5n  from  (B.22).  This  satisfies  one  of  the  requirements  of  Proposition  B.4.2. 
The  other  condition  on  \\Ctm  || 2  is  also  similarly  satisfied  due  to  the  bound  on  a  from 
Corollary  B.4.1.  Specifically,  we  have  that  ||Ct.mI|2  <  h'om  Corollary  B.4.1, 

and  use  the  same  sequence  of  inequalities  as  above  to  satisfy  the  second  requirement  of 
Proposition  B.4.2.  Thus  we  conclude  from  Proposition  B.4.2  and  from  (B.24)  that 


g7(As,  Al)  < 


64(3  -  v)D 
a£(T)v 


(B.25) 


This  bound  implies  that  g^(As,  Al)  <  ^  which  proves  the  parametric  consis¬ 

tency  part  of  the  theorem. 
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Since  the  bound  (B.25)  also  satisfies  the  condition  of  Proposition  B.4.1  (from  the 
inequality  following  (B.24)  above  we  see  that  g^(As,  A i)  <  5^-),  we  have  that 

g^R(As  +  AL))  <  2|^ff7(A5,Ai)2 

z  2D^Cl  / 64(3  —  v)D\2  2 

-  *(T)  V  a£(T)v  )  °n 

'8192^(3  -  ^)2D2  1  DSn 

a2£{T)2v 2  nJ  £(T) 

£><5n 

-  em 

Kv 

6(2 -z/)' 

In  the  hnal  inequality  we  used  the  bound  (B.23)  on  and  in  the  final  equality  we 
used  the  relation  An  =  This  concludes  the  algebraic  consistency  part  of  the 

theorem.  □ 
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Appendix  C 


Proofs  of  Chapter  5 


■  C.l  Proof  of  Proposition  5.3.1 

Proof.  First  note  that  the  Gaussian  width  can  be  upper-bounded  as  follows: 


^cns*’"1)  <  e: 


sup  g  z 

zems(o,i) 


(C.l) 


where  B( 0, 1)  denotes  the  unit  Euclidean  ball.  The  expression  on  the  right  hand  side 
inside  the  expected  value  can  be  expressed  as  the  optimal  value  of  the  following  convex 
optimization  problem  for  each  g  G  Mp: 


maxz  g  z 
s.t.  z  G  C 

1 1  z  1 1 2  <  1 


(C.2) 


We  now  proceed  to  form  the  dual  problem  of  (C.2)  by  first  introducing  the  Lagrangian 

£(z,  u,  7)  =  grz  +  7(1  -  z1  z)  -  u1  z 


where  u  £  C*  and  7  >  0  is  a  scalar.  To  obtain  the  dual  problem  we  maximize  the 
Lagrangian  with  respect  to  z,  which  amounts  to  setting 


Plugging  this  into  the  Lagrangian  above  gives  the  dual  problem 


min  7  +  37IIS  ~  u 
s.t.  u  G  C* 

7  >  0. 
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Solving  this  optimization  with  respect  to  7  we  find  that  7  =  ^||g  —  u||,  which  gives  the 
dual  problem  to  (C.2) 


mm 

s.t. 


||g  -  U|| 

u£C* 


(C.3) 


Under  very  mild  assumptions  about  C,  the  optimal  value  of  (C.3)  is  equal  to  that  of 
(C.2)  (for  example  as  long  as  C  has  a  non-empty  relative  interior,  strong  duality  holds). 
Hence  we  have  derived 


Eo 


T 

sup  g  Z 

zeCnB(o,i) 


=  Eg  [dist(g,  C*)\ . 


This  equation  combined  with  the  bound  (C.l)  gives  us  the  desired  result. 


(C.4) 

□ 


■  C.2  Proof  of  Theorem  5.3.3 

Proof.  We  set  /3  =  First  note  that  if  f3  >  exp{|}  then  the  width  bound  exceeds 
yf p ,  which  is  the  maximal  possible  value  for  the  width  of  C.  Thus,  we  will  assume 
throughout  that  f3  <  expjjE}. 

Using  Proposition  5.3.1  we  need  to  upper  bound  the  expected  distance  to  the  polar 
cone.  Let  g  ~  A7(0, 1)  be  a  normally  distributed  random  vector.  Then  the  norm  of  g  is 
independent  from  the  angle  of  g.  That  is,  |jg||  is  independent  from  g/||g||.  Moreover, 
g/ 1 1 g 1 1  is  distributed  as  a  uniform  sample  on  Sp_1,  and  Eg[||g||]  <  y/p.  Thus  we  have 

Eg[dist(g,C*)]  <  Eg[||g||  •  dist(g/||g||,C*  n  Sp_1)]  <  v/pEu[dist(u,  C*  n  §p_1)]  (C.5) 

where  u  is  sampled  uniformly  on  Sp_1. 

To  bound  the  latter  quantity,  we  will  use  isoperimetry.  Suppose  A  is  a  subset  of 
Sp_1  and  B  is  a  spherical  cap  with  the  same  volume  as  A.  Let  N (A,  r)  denote  the  locus 
of  all  points  in  the  sphere  of  Euclidean  distance  at  most  r  from  the  set  A.  Let  //  denote 
the  Haar  measure  on  Sp_1  and  p(A;r)  denote  the  measure  of  N(A,r).  Then  spherical 
isoperimetry  states  that  fi(A;r)  >  fj,(B\r)  for  all  r  >  0  (see,  for  example  [95, 106]). 

Let  B  now  denote  a  spherical  cap  with  n{B)  =  p(C*  n  §p_1).  Then  we  have 

/»oo 

Eu[dist(u,  C*  n  S^"1)]  =  /  P[dist(u,  C*  n  Sp_1)  >  t]dt  (C.6) 

Jo 

(1-  n(C*  nSP"1;*))^  (C.7) 

/»oo 

<  (1  —  p(B;t))dt 

Jo 


(C.8) 
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where  the  first  equality  is  the  integral  form  of  the  expected  value  and  the  last  inequality 
follows  by  isoperimetry.  Hence  we  can  bound  the  expected  distance  to  the  polar  cone 
intersecting  the  sphere  using  only  knowledge  of  the  volume  of  spherical  caps  on  Sp_1. 

To  proceed  let  v((p)  denote  the  volume  of  a  spherical  cap  subtending  a  solid  angle 
cp.  An  explicit  formula  for  v(cp)  is 


v(ip)  =  z~l  sinp~1(‘d)d'd  (C.9) 

Jo 

where  zp  =  f(J  sinp_1  {;d)d'd  [88].  Let  ip(P)  denote  the  minimal  solid  angle  of  a  cap 
such  that  /3  copies  of  that  cap  cover  Sp_1.  Since  the  geodesic  distance  on  the  sphere  is 
always  greater  than  or  equal  to  Euclidean  distance,  if  AT  is  a  spherical  cap  subtending 
iji  radians,  p(K;t)  >  v(tp  +  t).  Therefore 


roc  roc 

/  (1  —  p(B;t))dt  <  /  (1  —  v(<p(f3)  +  t))dt . 

Jo  Jo 


We  can  proceed  to  simplify  the  right-hand-side  integral: 


roc  rTT—^pyp) 

/  (1  ~v(ip(P)+t))dt=  (1  -  v(ip(P)  +  t))dt 

Jo  Jo 

=  7T  -  <p(P)  -  f 

Jo 

rn-ip(p)  r 

=  7T  -  ip(P)  -  Z~ 1 

Jo  Jo 


0 

r 

v(ip(P)  +  t)dt 

n-<p(/3)  pip(/3)+t 


sinp  1  ddadt 


=  ir-  ip(P)  -  z 


-l 


r-TC  rTT-<p(P) 

>0  imax(#-  <p(/3),0) 


sinp  1  iJdtda 


(C.10) 

(C.ll) 

(C.12) 

(C.13) 

(C.14) 


rix 

=  7T  —  (p(P)  —  Zp1  /  {it  —  (p(P)  —  max(ti  —  ip(P),  0)}  sinp_1  dda 

Jo 

(C.15) 

r  7T 

=  Zp 1  /  max(d  —  (p(P),  0)  sinp_1  dda  (C.16) 

Jo 

r?r 

- 1  1  {■&  -  tp(P))  sinP-1  dda  (C.17) 


=  z, 


'#) 


(C.14)  follows  by  switching  the  order  of  integration  and  the  rest  of  these  equalities 
follow  by  straight-forward  integration  and  some  algebra. 

Using  the  inequalities  that  zp  >  w=f  (see  [95])  and  sin(x)  <  exp(— (x  —  it /2)2 /2) 
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for  x  G  [0, 7 r],  we  can  bound  the  last  integral  as 


-i 


M/5) 


—  <p(/3))  sinp  1  'dda  < 


Vp~  1 


M/5) 


(tf-^(/Mexp(-^(tf-f)2)cM 

(C.18) 


Performing  the  change  of  variables  a  =  yfp  —  1(7?  —  |),  we  are  left  with  the  integral 

1  /'v/p~T7r/2  r  a  /  7r 

2  J^=l(<p(P)-n/2)  l  \/P  “  1  +  ^  2  ^ 


2\/p-rT 


exp 


a 

”2 


y'p—lir/2 


+ 


v/P-1((f(/3)-V2) 


< 


:2v^T 


exp 


p  —  1 


(»/2  -  (f  -  V 


i)da 

(C.19) 

rs/p=lir/2  / 

'  a2  \ 

/  exp 

--)da 

r/2)  ' 

v  2  y 

(C.20) 

wj) 

(C.21) 

In  this  final  bound,  we  bounded  the  first  term  by  dropping  the  upper  integrand,  and 
for  the  second  term  we  used  the  fact  that 

poo 

(C.22) 


/OO 

exp(— x2 /2)dx  =  vMr. 

-OO 


We  are  now  left  with  the  task  of  computing  a  lower  bound  for  ip(/3).  We  need  to 
first  reparameterize  the  problem.  Let  K  be  a  spherical  cap.  Without  loss  of  generality, 
we  may  assume  that 

K  =  {x  G  Sp_1  :  xi  >  h}  (C.23) 

for  some  h  G  [0,1].  h  is  the  height  of  the  cap  over  the  equator.  Via  elementary 
trigonometry,  the  solid  angle  that  K  subtends  is  given  by  tt/2  —  sin-1(li).  Hence, 
if  h((3)  is  the  largest  number  such  that  f3  caps  of  height  h  cover  Sp_1,  then  h(/3)  = 
sin(7r/2  -  0(/3)). 

The  quantity  h(/3)  may  be  estimated  using  the  following  estimate  from  [25].  For 
h  G  [0, 1],  let  7 (p,  h)  denote  the  volume  of  a  spherical  cap  of  of  height  h. 


Lemma  C.2.1  (  [25]).  For  1  >  h  >  M, 

(1  —  h2)^  <7 (p,h)  < 


lOky/p 


1  (1  -h2)’^. 


2  hy/p 


(C.24) 


Note  that  for  h  > 


^(1  -  <  |(1  -h^<  lexp(-Tl^) . 


(C.25) 
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So  if 


h 


21og(4/3) 

p  —  1 


(C.26) 


then  h  <  1  because  we  have  assumed  f3  <  |  exp(4^31^).  Moreover,  h  >  yb  and  the 
volume  of  the  cap  with  height  h  is  less  than  or  equal  to  1//3.  That  is 


<p((3)  >  vr/2  —  sin  1 


2  log  (4/3)  \ 

p-1  J 


(C.27) 


Combining  the  estimate  (C.21)  with  Proposition  5.3.1,  and  using  our  estimate  for  <p(/3), 
we  get  the  bound 

{ /  1  /  P  (  P- 1  •  -i  ^  / 2 log(4/3) ^  ^  [np  .  -i  ( 

u,(c)s2V/^TexprVsm  J+VTsm  {\l^r) 

(C.28) 

This  expression  can  be  simplified  by  using  the  following  bounds.  First,  sin_1(x)  >  x  lets 
us  upper  bound  the  first  term  by  '/TTs/T  ^or  second  term,  using  the  inequality 
sin-1  (x)  <  |x  results  in  the  upper  bound 

(&29) 


For  p  >  9  the  upper  bound  can  be  expressed  simply  as  w(C )  <  3y/log(4/3).  We  recall 
that  /3  =  g,  which  completes  the  proof  of  the  theorem.  □ 


■  C.3  Direct  Width  Calculations 

We  first  give  the  proof  of  Proposition  5.3.2. 

Proof.  Let  x*  be  an  .s-sparse  vector  in  Rp  with  i\  norm  equal  to  1,  and  let  A  denote 
the  set  of  unit-Euclidean-norm  one-sparse  vectors.  Let  A  denote  the  set  of  coordinates 
where  x*  is  non-zero.  Recall  from  Chapter  2  that  the  normal  cone  at  x*  with  respect 
to  the  i\  ball  is  given  by 

AU(x*)  =  cone{z  £  Mp  :  z,;  =  sgn(x*)  for  i  £  A,  |z*|  <  1  for  i  £  Ac}  (C.30) 

=  {z  £  Mp  :  z i  =  fsgn(x*)  for  i  £  A,  |z*|  <  t  for  i  £  Ac  for  some  t  >  0}  . 

(C.31) 
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Here  Ac  represents  the  zero  entries  of  x*. 

Given  g  ~  jV(0,  Ip),  we  would  like  to  construct  a  u  e  JV a(x*)  that  is  close  to  g. 
Pick  u(g)  as 

f  gj  i  £  Ac 

Ui(g)  =  {  (C.32) 

[  ||gAc||ooSgn(x*)  n eA 

That  is,  we  set  u(g)  equal  to  g  on  Ac.  On  A,  we  set  u(g)  proportional  to  the  sign  of 
x*,  and  scale  this  sign  vector  appropriately  by  the  l ^  norm  of  g  on  Ac.  For  this  choice, 
we  have 


E[||u(g)  ”  gll2]  =  E[lluA(g)  -  gA ||2]  (C.33) 

=  E[||uA(g)||2]+E[||gA||2]  (C.34) 

=  sE[||gAc  H^q]  +  s  (C.35) 

<  2slog(p  —  s)  +  2s  (C.36) 


Here,  the  second  equality  holds  because  gAc  and  gA  are  independent.  The  final  in¬ 
equality  follows  because  the  maximum  squared  magnitude  of  a  sequence  of  p  —  s  normal 
random  variables  is  bounded  above  by  21og(p  —  s)  +  1.  By  Corollary  5.3.1,  this  means 
that  the  i\  heuristic  succeeds  when  n  exceeds  2p(log(p  —  s)  +  1). 

For  small  values  of  s,  we  can  tighten  this  result.  The  minimum  squared  distance 
to  the  normal  cone  at  x*  can  be  formulated  as  a  one-dimensional  convex  optimization 
problem  for  arbitrary  zFff 


inf  II  z  —  u  1 1  o 

ueAU(x*) 


inf  “  isgn(xD)2  +  ~ 

|Uj|<t7  ie  Ac  ieAc 

mfT(z«  -  tsgn(x*))2  +  ^  shrink {zj,t)2 
~  ie  A  je  Ac 


(C.37) 

(C.38) 


where 


z  +  t  z  <  —t 


shrink^,  t ) 


<  0  —t<z<t 


[z  —  t  z  >  t 


(C.39) 
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is  the  ^-shrinkage  function.  Hence,  for  any  fixed  t  >  0  independent  of  g,  we  have 


E 


inf  J|g-Ull2 

u£Na(x*) 


<  E 


X](g i  -  fsgn (xi))2  +  shrink(gj,t)5 
ie  A  je  Ac 


=  s(l  +  t2)  +E 


shrink(gj,  t)^ 

ieAc 


(C.40) 


(C.41) 


Now  we  directly  integrate  the  second  term,  treating  each  summand  individually. 
For  a  zero-mean,  unit-variance  normal  random  variable  g, 

1  /—*  1  r°° 

E  [shrink (g,t)2]=-=  {g  +  t)2  exp(-g2/2)dg  +  —==  /  (g  -  t)2  exp(-g2 /2)dg 

V  27T  J — oo  V  J t 

(C.42) 

2  Z"00 

J  (g  —  t)2  exp(—g2 /2)dg 

2(1  +  t2)  f°° 


s/2 7T  . 
2 


texp(-t2/2)  + - /== —  /  exp(-c/2/2)d£f 

V27T  Jt 


<J=(-t+1-±f)ex  p(_tV2) 


V 

2  1 


t 


exp(— 12/2) . 


(C.43) 

(C.44) 

(C.45) 

(C.46) 


v/2 *t 

The  first  simplification  follows  because  the  shrink  function  and  Gaussian  distributions 
are  symmetric  about  the  origin.  The  second  equality  follows  by  integrating  by  parts. 
The  inequality  follows  by  a  tight  bound  on  the  Gaussian  Q-function 


1  1  1 

Q(x)  =  , _  /  exp(— g2 /2)dg  <  _ —  exp(— x2/2)  for  x  >  0  . 

Using  this  bound,  we  get 
E 


|_ueiV4(x*) 

Setting  t  =  \/21og(p/s  —  1)  —  1  gives 


inf  Jg-u||jj|  <s{l  +  t2)  +  (p-s)^=\exp(-t2/2) 

v  Ait  t 


E 


inf  II  g  —  z||2 

z  eNA(x.*) 


<  2s  [  log  (  - — -  )  +  1 


(C.47) 


(C.48) 


(C.49) 


provided  that  s  <  This  bound  on  s  arises  because  t  must  be  greater  than  or  equal 

to  0  and  the  second  term  in  (C.48)  is  set  to  be  less  than  2s. 

□ 
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Next  we  give  the  proof  of  Proposition  5.3.3. 

Proof.  Let  x*  be  an  mi  x  m2  matrix  of  rank  r  with  singular  value  decomposition  U'PV* , 
and  let  A  denote  the  set  of  rank-one  unit-Euclidean-norm  matrices  of  size  m\  x  m 2- 
Without  loss  of  generality,  impose  the  conventions  mi  <  m2,  £  is  r  x  r,  U  is  mi  x  r,  V 
is  m2  X  v,  and  assume  the  nuclear  norm  of  x*  is  equal  to  1. 

Let  Ufc  (respectively  v^)  denote  the  fc’th  column  of  U  (respectively  V).  It  is  con¬ 
venient  to  introduce  the  orthogonal  decomposition  RmiXm2  =  A  0  where  A  is  the 
linear  space  spanned  by  elements  of  the  form  u^z2"  and  yv^ ,  1  <  k  <  r,  where  z 
and  y  are  arbitrary,  and  A-2  is  the  orthogonal  complement  of  A.  The  space  A^  is 
the  subspace  of  matrices  spanned  by  the  family  (yzT),  where  y  (respectively  z)  is  any 
vector  orthogonal  to  all  the  columns  of  U  (respectively  V).  Recall  from  Chapter  2  that 
the  normal  cone  of  the  nuclear  norm  ball  at  x*  is  given  by  the  cone  generated  by  the 
subdifferential  at  x*: 

AU(x*)  =cone{UVT +  W  eRmiXm2  :  WTU  =  0,  WV  =  0,  \\W\\*A<l}  (C.50) 

=  {tUV*  +  W  £  RmiXm2  :  WTU  =  0,  WV  =  0,  \\W\\A<t,  t  >  0}  . 

(C.51) 

Note  that  here  ||Z||^  is  the  operator  norm,  equal  to  the  maximum  singular  value  of 
Z  [121]. 

Let  G  be  a  Gaussian  random  matrix  with  i.i.d.  entries,  each  with  mean  zero  and 


unit  variance.  Then  the  matrix 

Z(G)  =  \\Va±(G)\\UV*  +  Va±(G)  (C.52) 

is  in  the  normal  cone  at  x*.  We  can  then  compute 

E  [\\G  -  Z(G)\\l]  =  E  [\\VA (G)  -  Va(Z(G))\\2f]  (C.53) 

=  E[\\Va(G)\\2f]  +  E  [\\V a{Z (G))\\2f]  (C.54) 

=  r(mi  +  m2  —  r)  +  tE[||7:,ax(G)||2]  .  (C.55) 


Here  (C.54)  follows  because  Va (G)  and  Va±(G)  are  independent.  The  final  line  follows 
because  dirn(T)  =  r(mi  +  m2  —  r)  and  the  Frobenius  (i.e.,  Euclidean)  norm  of  UV*  is 
||  UV*  || f  =  sfr-  Due  to  the  isotropy  of  Gaussian  random  matrices,  RA±(G)  is  identically 
distributed  as  an  (mi  —  r)  x  (m2  —  r)  matrix  with  i.i.d.  Gaussian  entries  each  with 
mean  zero  and  variance  one.  We  thus  know  that 


P  [\\VA±  (G)  ||  >  \Jm\  -  r  +  v/m2  -  r  +  s]  <  exp  (-s2/ 2) 


(C.56) 
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(see,  for  example,  [41]).  To  bound  the  latter  expectation,  we  again  use  the  integral  form 
of  the  expected  value.  Letting  fiT±  denote  the  quantity  y/ m\  —  r  +  \Jrri2  —  r,  we  have 


p  OO 

E  [II^a-l (G') II2]  =  /  IP  [II^a4G)||2  >  h]  dh  (C.57) 

Jo 

poo 

<H2t±+  F\\\VA±(G)\\2>h]dh  (C.58) 

Jv2t± 

poo 

<  +  /  P  [II^A-1-  (f-*9 II 2  Mt1-  ~P  ^  (C.59) 

Jo 

/“OO  r 

<  /i^±  +  /  P  ||7V(G)||  >  nT±  +  Vt  dt  (C.60) 

Jo  L 

noo 

<  +  /  exp(— t/2)dt  (C.61) 

Jo 

=  h2t  x  +  2  (C.62) 


Using  this  bound  in  (C.55),  we  get  that 


E 


inf  \G-Z\i 
ZeNA(: k*) 


<  r(m\  +  m2  —  r)  +  r(y/m  1  —  r  +  \Jm2  —  r)2  +  2 r 

<  r(m\  +  m2  —  r)  +  2r{m\  +  m2  —  2r)  +  2?’ 

<  3r(mi  +  ?7i2  —  r) 


(C.63) 

(C.64) 

(C.65) 


where  the  second  inequality  follows  from  the  fact  that  ( a  +  b )2  <  2a?  +  21? .  We  conclude 
that  3r(mi  +  m2  —  r)  random  measurements  are  sufficient  to  recover  a  rank  r,  mi  x  m2 
matrix  using  the  nuclear  norm  heuristic.  □ 
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Appendix  D 


Properties  of  Convex  Symmetric 

Functions 


A  convex  symmetric  function  is  a  convex  function  that  is  invariant  with  respect  to  a 
permutation  of  the  argument: 

Definition  D.0.1.  A  function  g  :  Mn  — >  M  is  a  convex  symmetric  function  if  it  is 
convex,  and  if  for  any  x  €  Rn  it  holds  that  p(IIx)  =  g(x)  for  all  permutation  matrices 
II  €  Syrn(n). 

The  properties  of  such  functions  are  well-known  in  the  literature  on  convex  analysis 
and  optimization,  and  they  arise  in  many  applications.  We  briefly  describe  some  of 
these  properties  and  applications  here. 

An  important  class  of  convex  symmetric  functions  is  the  set  of  linear  functionals 
given  by  monotone  linear  functionals : 

s(x)  =  vi’x> 

where  vi  >  •  •  •  >  vn.  Recall  that  x  is  the  vector  obtained  by  sorting  the  entries  of  x 
in  descending  order.  Monotone  linear  functionals  can  be  used  to  express  any  convex 
symmetric  function.  Specifically,  let  A4  C  Mn  represent  the  cone  of  monotone  decreasing 
vectors  in  Mn.  Then  for  any  convex  symmetric  function  g  :  Mn  — >  M,  we  have  that 

g(x)  =  sup  v7x  —  av. 
veM 

This  statement  is  a  simple  consequence  of  the  separation  theorem  from  convex  analysis 
[124],  Monotone  linear  functionals  in  turn  can  be  expressed  as  the  nonnegative  sum 
of  even  more  elementary  functions  called  distribution  functions,  which  are  defined  as 
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follows: 

k 

9k(x)  =  X]  (*)<• 

i= 1 

These  functions  are  closely  related  to  the  notion  of  conditional  value-at-risk  [125],  which 
in  turn  is  computed  using  quantiles  of  probability  distributions. 

Convex  symmetric  functions  are  intimately  connected  with  the  concept  of  majoriza¬ 
tion  [104],  There  are  many  equivalent  characterizations  of  majorization  [42,97],  and  we 
briefly  mention  some  of  these  next.  A  vector  x  £  Rn  is  said  to  majorize  another  vector 
y  £  Rn  if 

fffc(x)  >  gk{ y),  V&  =  1, . . . ,  n  —  1  and  gn(x.)  =  gn( y). 

The  permutahedron  of  a  vector  x  £  Rn  is  the  convex  hull  of  all  permutations  of  x,  and 
is  given  by  the  set  of  vectors  in  Rn  that  are  majorized  by  x.  Thus,  convex  constraints 
given  by  distribution  functions  provide  a  simple  characterization  of  the  permutahedron 
generated  by  x.  Majorization  is  also  closely  related  to  the  notion  of  Lorenz  dominance ; 
a  (typically  nonnegative)  vector  x  £  Rn  is  said  to  Lorenz- dominate  y  £  Rp  if  — x 
is  majorized  by  — y.  Lorenz  dominance  is  used  to  measure  the  level  of  inequality  in 
distributions,  i.e. ,  if  a  distribution  x  Lorenz-dominates  a  distribution  y  then  x  is  “more 
equal”  than  y  (see  also  the  Gini  coefficient,  which  is  used  to  measure  inequalities  in 
countries). 

A  convex  symmetric  function  is  an  example  of  a  Schur-convex  function,  which  is 
a  function  /  such  that  /(x)  >  /( y)  whenever  x  majorizes  y.  Hence  a  Schur-convex 
function  preserves  order  with  respect  to  majorization.  Consequently,  such  functions 
arise  in  many  applications  in  which  majorization  plays  a  prominent  role  [104],  We 
note  that  the  functions  that  are  both  convex  and  Schur-convex  are  exactly  the  convex 
symmetric  functions. 

A  fairly  similar  set  of  results  hold  for  convex  functions  of  symmetric  matrices  that 
are  invariant  under  conjugation  of  the  argument  by  orthogonal  matrices,  i.e.,  convex 
functions  /  :  Sn  — >  R  such  that  f(VAVT)  =  f(A )  for  all  A  £  Sra  and  for  all  n  £  Syrn(n). 
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